From owner-freebsd-questions@FreeBSD.ORG  Wed Dec  3 00:00:07 2008
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 176411065672
	for <freebsd-questions@freebsd.org>;
	Wed,  3 Dec 2008 00:00:07 +0000 (UTC)
	(envelope-from kline@thought.org)
Received: from aristotle.thought.org (ns1.thought.org [209.180.213.210])
	by mx1.freebsd.org (Postfix) with ESMTP id AE6DE8FC13
	for <freebsd-questions@freebsd.org>;
	Wed,  3 Dec 2008 00:00:06 +0000 (UTC)
	(envelope-from kline@thought.org)
Received: from thought.org (tao.thought.org [10.47.0.250])
	(authenticated bits=0)
	by aristotle.thought.org (8.14.2/8.14.2) with ESMTP id mB300WXD067764; 
	Tue, 2 Dec 2008 16:00:32 -0800 (PST)
	(envelope-from kline@thought.org)
Received: by thought.org (nbSMTP-1.00) for uid 1002
	kline@thought.org; Tue,  2 Dec 2008 15:59:59 -0800 (PST)
Date: Tue, 2 Dec 2008 15:59:59 -0800
From: Gary Kline <kline@thought.org>
To: Roland Smith <rsmith@xs4all.nl>
Message-ID: <20081202235959.GB63279@thought.org>
References: <20081201231440.GA30682@thought.org>
	<20081202010730.GA15970@slackbox.xs4all.nl>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20081202010730.GA15970@slackbox.xs4all.nl>
User-Agent: Mutt/1.4.2.3i
X-Organization: Thought Unlimited. Public service Unix since 1986.
X-Of_Interest: With 22 years  of service to the Unix community.
X-Spam-Status: No, score=-4.4 required=3.6 tests=ALL_TRUSTED,BAYES_00
	autolearn=ham version=3.2.3
X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on
	aristotle.thought.org
Cc: FreeBSD Mailing List <freebsd-questions@freebsd.org>
Subject: Re: any way to turn a pdf file into something OCR-able?
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 03 Dec 2008 00:00:07 -0000

On Tue, Dec 02, 2008 at 02:07:30AM +0100, Roland Smith wrote:
> On Mon, Dec 01, 2008 at 03:14:43PM -0800, Gary Kline wrote:
> > 	pdftotext fail on the large [32MB] file I've got.  Is there any
> > 	other way I can translate this huge textfile to ascii or html or
> > 	text?
> 
> Please define "fail" in this context? I've used pdftotxt on documents
> exceeding 40MB. However there are of course things that don't work;
> 
> 1) Some PDFs are just wrappers around JPEG images. In this case there is
> no text for pdftotext to convert => epic fail.
> 
> 2) If the text contains ligatures etc. you should use the proper
> encoding that contains such characters (e.g. '-enc UTF-8') or you will
> loose them.
> 
> 3) Things like equations will not render well, if at all. This also
> depends on the encoding.


	It probably was a pdf wrapped around a jpeg.   I was able to to
	another pdf to plaintext in a flash.   (*sigh*)  it wasn't a total
	waste of time because I found the entire text transfered to  buugy
	ASCII somewhere [[ thanks to some prof ]].  So, if I ever want to run aspell
	against a 900-page file, at least I have that option!

	gary


> 
> Roland
> -- 
> R.F.Smith                                   http://www.xs4all.nl/~rsmith/
> [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
> pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)


-- 
 Gary Kline  kline@thought.org  http://www.thought.org  Public Service Unix
        http://jottings.thought.org   http://transfinite.thought.org
 Flash: The alpha release of Jottings is available: http://jottings.thought.org/index.php