From owner-freebsd-questions@FreeBSD.ORG Tue Jan 27 00:10:06 2009 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C21721065674 for ; Tue, 27 Jan 2009 00:10:06 +0000 (UTC) (envelope-from freebsd@edvax.de) Received: from mx01.qsc.de (mx01.qsc.de [213.148.129.14]) by mx1.freebsd.org (Postfix) with ESMTP id 7E22A8FC16 for ; Tue, 27 Jan 2009 00:10:06 +0000 (UTC) (envelope-from freebsd@edvax.de) Received: from r55.edvax.de (port-92-196-68-197.dynamic.qsc.de [92.196.68.197]) by mx01.qsc.de (Postfix) with ESMTP id 15AB43CB46; Tue, 27 Jan 2009 01:10:02 +0100 (CET) Received: from r55.edvax.de (localhost [127.0.0.1]) by r55.edvax.de (8.14.2/8.14.2) with SMTP id n0R09tUl006091; Tue, 27 Jan 2009 01:09:55 +0100 (CET) (envelope-from freebsd@edvax.de) Date: Tue, 27 Jan 2009 01:09:55 +0100 From: Polytropon To: cpghost Message-Id: <20090127010955.88683592.freebsd@edvax.de> In-Reply-To: <20090126223906.GA2444@phenom.cordula.ws> References: <20090126001822.GA38314@thought.org> <20090126005156.GJ66858@comcast.net> <497D0FF3.6090402@telenix.org> <20090126080618.GA51983@thought.org> <20090126091623.a0b50f64.freebsd@edvax.de> <20090126220623.GA76673@thought.org> <20090126223906.GA2444@phenom.cordula.ws> Organization: EDVAX X-Mailer: Sylpheed 2.4.7 (GTK+ 2.12.1; i386-portbld-freebsd7.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: Gary Kline , FreeBSD Mailing List Subject: Re: can i split a pdf file? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Polytropon List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Jan 2009 00:10:07 -0000 On Mon, 26 Jan 2009 23:39:06 +0100, cpghost wrote: > Those PDFs are usually scanned, > and the scanner software (usually on Windows) assembles all screenshots > into a PDF of images. Handy for printing, but not for OCR postprocessing. > That's what you find on the Net. On the Web. :-) > This is not such a bad idea, esp. when it comes to technical textbooks, > which usually contain a lot of diagrams, formulae, tables etc...; since > an OCR software that would be able to reverse all this into LaTeX and > EPS figures has yet to be programmed (that's a difficult task). As I've already mentioned, scanning the characters is only one part. Your example of diagrams and formulas is good to illustrate this. And because LaTeX is the only professional typesetting system (and no, "Word" isn't such a tool), it would be really great to have a tool pdf2tex which would get the characters of the text, typeset them as in the original (paragraphing, hyphenation etc.), input embedded pictures as pictures (of course), re-create formulas so the result would run through pdf-LaTeX and produce an improved version of the source PDF file. But that's a task for the next generation of mankind. :-) > Some PDFs encode the fonts > in a special section, and then use text (sometimes compressed > or encrypted), which refers to those fonts. In such a case, you > could extract the pure text from the PDF. It's worth mentioning that if the original text has characters (represented in the additionally stored fonts) that have special accents or orientations (non-english languages usually), the target system needs to support them, which it usually does through the means of UTF-8. > Other PDFs simply encode the book as a set of bitmaps (see above); > and then your only chance is to find an OCR software that would not > only be able to recognize the characters in the bitmaps, but also > to cope with those Fraktur- or other exotic fonts. Yes, das Doytsh Uberfrucktoor makes everything unreadable. :-) It gets even more complicated with hand-written books... > Some OCR programs > are interactive and trainable, so that you can say: this is an 'S', > and that is a 'T'..., but AFAIK, there's no free and open source > OCR program with this capability (yet). Wow, never heared of this concept, but really intelligent solution. If this really works, it still has the "disadvantage" of needing much time for training the program, and postprocessing. It's easier to \usepackage[german]{uberfraktur} to make the text unreadable again. :-) -- Polytropon >From Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...