From owner-freebsd-questions@FreeBSD.ORG Wed Dec 3 00:00:07 2008 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 176411065672 for ; Wed, 3 Dec 2008 00:00:07 +0000 (UTC) (envelope-from kline@thought.org) Received: from aristotle.thought.org (ns1.thought.org [209.180.213.210]) by mx1.freebsd.org (Postfix) with ESMTP id AE6DE8FC13 for ; Wed, 3 Dec 2008 00:00:06 +0000 (UTC) (envelope-from kline@thought.org) Received: from thought.org (tao.thought.org [10.47.0.250]) (authenticated bits=0) by aristotle.thought.org (8.14.2/8.14.2) with ESMTP id mB300WXD067764; Tue, 2 Dec 2008 16:00:32 -0800 (PST) (envelope-from kline@thought.org) Received: by thought.org (nbSMTP-1.00) for uid 1002 kline@thought.org; Tue, 2 Dec 2008 15:59:59 -0800 (PST) Date: Tue, 2 Dec 2008 15:59:59 -0800 From: Gary Kline To: Roland Smith Message-ID: <20081202235959.GB63279@thought.org> References: <20081201231440.GA30682@thought.org> <20081202010730.GA15970@slackbox.xs4all.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20081202010730.GA15970@slackbox.xs4all.nl> User-Agent: Mutt/1.4.2.3i X-Organization: Thought Unlimited. Public service Unix since 1986. X-Of_Interest: With 22 years of service to the Unix community. X-Spam-Status: No, score=-4.4 required=3.6 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.2.3 X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on aristotle.thought.org Cc: FreeBSD Mailing List Subject: Re: any way to turn a pdf file into something OCR-able? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Dec 2008 00:00:07 -0000 On Tue, Dec 02, 2008 at 02:07:30AM +0100, Roland Smith wrote: > On Mon, Dec 01, 2008 at 03:14:43PM -0800, Gary Kline wrote: > > pdftotext fail on the large [32MB] file I've got. Is there any > > other way I can translate this huge textfile to ascii or html or > > text? > > Please define "fail" in this context? I've used pdftotxt on documents > exceeding 40MB. However there are of course things that don't work; > > 1) Some PDFs are just wrappers around JPEG images. In this case there is > no text for pdftotext to convert => epic fail. > > 2) If the text contains ligatures etc. you should use the proper > encoding that contains such characters (e.g. '-enc UTF-8') or you will > loose them. > > 3) Things like equations will not render well, if at all. This also > depends on the encoding. It probably was a pdf wrapped around a jpeg. I was able to to another pdf to plaintext in a flash. (*sigh*) it wasn't a total waste of time because I found the entire text transfered to buugy ASCII somewhere [[ thanks to some prof ]]. So, if I ever want to run aspell against a 900-page file, at least I have that option! gary > > Roland > -- > R.F.Smith http://www.xs4all.nl/~rsmith/ > [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] > pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) -- Gary Kline kline@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org Flash: The alpha release of Jottings is available: http://jottings.thought.org/index.php