From owner-freebsd-questions@FreeBSD.ORG Tue Dec 2 01:23:15 2008 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 279DC10656D5 for ; Tue, 2 Dec 2008 01:23:15 +0000 (UTC) (envelope-from roberthuff@rcn.com) Received: from smtp02.lnh.mail.rcn.net (smtp02.lnh.mail.rcn.net [207.172.157.102]) by mx1.freebsd.org (Postfix) with ESMTP id DB7038FC18 for ; Tue, 2 Dec 2008 01:23:14 +0000 (UTC) (envelope-from roberthuff@rcn.com) Received: from mr08.lnh.mail.rcn.net ([207.172.157.28]) by smtp02.lnh.mail.rcn.net with ESMTP; 01 Dec 2008 20:23:14 -0500 Received: from smtp01.lnh.mail.rcn.net (smtp01.lnh.mail.rcn.net [207.172.4.11]) by mr08.lnh.mail.rcn.net (MOS 3.10.3-GA) with ESMTP id KME32860; Mon, 1 Dec 2008 20:23:13 -0500 (EST) Received: from unknown (HELO jerusalem.litteratus.org.litteratus.org) ([209.6.22.188]) by smtp01.lnh.mail.rcn.net with ESMTP; 01 Dec 2008 20:23:09 -0500 From: Robert Huff MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <18740.36349.523718.591189@jerusalem.litteratus.org> Date: Mon, 1 Dec 2008 20:23:09 -0500 To: FreeBSD Mailing List In-Reply-To: <20081202010730.GA15970@slackbox.xs4all.nl> References: <20081201231440.GA30682@thought.org> <20081202010730.GA15970@slackbox.xs4all.nl> X-Mailer: VM 7.17 under 21.5 (beta28) "fuki" XEmacs Lucid X-Junkmail-Whitelist: YES (by domain whitelist at mr08.lnh.mail.rcn.net) Subject: Re: any way to turn a pdf file into something OCR-able? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 02 Dec 2008 01:23:15 -0000 Roland Smith writes: > > pdftotext fail on the large [32MB] file I've got. Is there any > > other way I can translate this huge textfile to ascii or html or > > text? > > Please define "fail" in this context? I've used pdftotxt on > documents exceeding 40MB. However there are of course things that > don't work; > > 1) Some PDFs are just wrappers around JPEG images. In this case > there is no text for pdftotext to convert => epic fail. In this case "convert" from the ImageMagick port will get you a series of .jpg/.gif/.. Read the manual carefully before attempting; also note this can be a slow process. Robert Huff