From owner-freebsd-questions@FreeBSD.ORG  Tue Dec  2 01:07:33 2008
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 634A6106564A
	for <freebsd-questions@freebsd.org>;
	Tue,  2 Dec 2008 01:07:33 +0000 (UTC)
	(envelope-from rsmith@xs4all.nl)
Received: from smtp-vbr9.xs4all.nl (smtp-vbr9.xs4all.nl [194.109.24.29])
	by mx1.freebsd.org (Postfix) with ESMTP id 0415B8FC1A
	for <freebsd-questions@freebsd.org>;
	Tue,  2 Dec 2008 01:07:32 +0000 (UTC)
	(envelope-from rsmith@xs4all.nl)
Received: from slackbox.xs4all.nl (slackbox.xs4all.nl [213.84.242.160])
	by smtp-vbr9.xs4all.nl (8.13.8/8.13.8) with ESMTP id mB217UTK064261;
	Tue, 2 Dec 2008 02:07:31 +0100 (CET) (envelope-from rsmith@xs4all.nl)
Received: by slackbox.xs4all.nl (Postfix, from userid 1001)
	id 673DBBA8A; Tue,  2 Dec 2008 02:07:30 +0100 (CET)
Date: Tue, 2 Dec 2008 02:07:30 +0100
From: Roland Smith <rsmith@xs4all.nl>
To: Gary Kline <kline@thought.org>
Message-ID: <20081202010730.GA15970@slackbox.xs4all.nl>
References: <20081201231440.GA30682@thought.org>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="82I3+IH0IqGh5yIs"
Content-Disposition: inline
In-Reply-To: <20081201231440.GA30682@thought.org>
X-GPG-Fingerprint: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725
X-GPG-Key: http://www.xs4all.nl/~rsmith/pubkey.txt
X-GPG-Notice: If this message is not signed, don't assume I sent it!
User-Agent: Mutt/1.5.18 (2008-05-17)
X-Virus-Scanned: by XS4ALL Virus Scanner
Cc: FreeBSD Mailing List <freebsd-questions@freebsd.org>
Subject: Re: any way to turn a pdf file into something OCR-able?
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 02 Dec 2008 01:07:33 -0000


--82I3+IH0IqGh5yIs
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Dec 01, 2008 at 03:14:43PM -0800, Gary Kline wrote:
> 	pdftotext fail on the large [32MB] file I've got.  Is there any
> 	other way I can translate this huge textfile to ascii or html or
> 	text?

Please define "fail" in this context? I've used pdftotxt on documents
exceeding 40MB. However there are of course things that don't work;

1) Some PDFs are just wrappers around JPEG images. In this case there is
no text for pdftotext to convert =3D> epic fail.

2) If the text contains ligatures etc. you should use the proper
encoding that contains such characters (e.g. '-enc UTF-8') or you will
loose them.

3) Things like equations will not render well, if at all. This also
depends on the encoding.

Roland
--=20
R.F.Smith                                   http://www.xs4all.nl/~rsmith/
[plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)

--82I3+IH0IqGh5yIs
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (FreeBSD)

iEYEARECAAYFAkk0ilIACgkQEnfvsMMhpyX9GwCgljxePhLFAy/thtzNiyTbvHeM
nhMAn34OELIwnwlX7OqyRa4rEg46fVG4
=1x24
-----END PGP SIGNATURE-----

--82I3+IH0IqGh5yIs--