From owner-freebsd-questions@FreeBSD.ORG  Tue Jan 27 00:10:06 2009
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C21721065674
	for <freebsd-questions@freebsd.org>;
	Tue, 27 Jan 2009 00:10:06 +0000 (UTC)
	(envelope-from freebsd@edvax.de)
Received: from mx01.qsc.de (mx01.qsc.de [213.148.129.14])
	by mx1.freebsd.org (Postfix) with ESMTP id 7E22A8FC16
	for <freebsd-questions@freebsd.org>;
	Tue, 27 Jan 2009 00:10:06 +0000 (UTC)
	(envelope-from freebsd@edvax.de)
Received: from r55.edvax.de (port-92-196-68-197.dynamic.qsc.de [92.196.68.197])
	by mx01.qsc.de (Postfix) with ESMTP id 15AB43CB46;
	Tue, 27 Jan 2009 01:10:02 +0100 (CET)
Received: from r55.edvax.de (localhost [127.0.0.1])
	by r55.edvax.de (8.14.2/8.14.2) with SMTP id n0R09tUl006091;
	Tue, 27 Jan 2009 01:09:55 +0100 (CET)
	(envelope-from freebsd@edvax.de)
Date: Tue, 27 Jan 2009 01:09:55 +0100
From: Polytropon <freebsd@edvax.de>
To: cpghost <cpghost@cordula.ws>
Message-Id: <20090127010955.88683592.freebsd@edvax.de>
In-Reply-To: <20090126223906.GA2444@phenom.cordula.ws>
References: <20090126001822.GA38314@thought.org>
	<20090126005156.GJ66858@comcast.net> <497D0FF3.6090402@telenix.org>
	<20090126080618.GA51983@thought.org>
	<20090126091623.a0b50f64.freebsd@edvax.de>
	<20090126220623.GA76673@thought.org>
	<20090126223906.GA2444@phenom.cordula.ws>
Organization: EDVAX
X-Mailer: Sylpheed 2.4.7 (GTK+ 2.12.1; i386-portbld-freebsd7.0)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: Gary Kline <kline@thought.org>,
	FreeBSD Mailing List <freebsd-questions@freebsd.org>
Subject: Re: can i split a pdf file?
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: Polytropon <freebsd@edvax.de>
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jan 2009 00:10:07 -0000

On Mon, 26 Jan 2009 23:39:06 +0100, cpghost <cpghost@cordula.ws> wrote:
> Those PDFs are usually scanned,
> and the scanner software (usually on Windows) assembles all screenshots
> into a PDF of images.

Handy for printing, but not for OCR postprocessing.


> That's what you find on the Net.

On the Web. :-)


> This is not such a bad idea, esp. when it comes to technical textbooks,
> which usually contain a lot of diagrams, formulae, tables etc...; since
> an OCR software that would be able to reverse all this into LaTeX and
> EPS figures has yet to be programmed (that's a difficult task).

As I've already mentioned, scanning the characters is only one part.
Your example of diagrams and formulas is good to illustrate this.
And because LaTeX is the only professional typesetting system
(and no, "Word" isn't such a tool), it would be really great to
have a tool pdf2tex which would get the characters of the text,
typeset them as in the original (paragraphing, hyphenation etc.),
input embedded pictures as pictures (of course), re-create
formulas so the result would run through pdf-LaTeX and
produce an improved version of the source PDF file.

But that's a task for the next generation of mankind. :-)


> Some PDFs encode the fonts
> in a special section, and then use text (sometimes compressed
> or encrypted), which refers to those fonts. In such a case, you
> could extract the pure text from the PDF.

It's worth mentioning that if the original text has characters
(represented in the additionally stored fonts) that have special
accents or orientations (non-english languages usually), the
target system needs to support them, which it usually does through
the means of UTF-8.


> Other PDFs simply encode the book as a set of bitmaps (see above);
> and then your only chance is to find an OCR software that would not
> only be able to recognize the characters in the bitmaps, but also
> to cope with those Fraktur- or other exotic fonts.

Yes, das Doytsh Uberfrucktoor makes everything unreadable. :-)

It gets even more complicated with hand-written books...


> Some OCR programs
> are interactive and trainable, so that you can say: this is an 'S',
> and that is a 'T'..., but AFAIK, there's no free and open source
> OCR program with this capability (yet).

Wow, never heared of this concept, but really intelligent solution.
If this really works, it still has the "disadvantage" of needing
much time for training the program, and postprocessing.

It's easier to \usepackage[german]{uberfraktur} to make the text
unreadable again. :-)


-- 
Polytropon
>From Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...