From owner-freebsd-questions@FreeBSD.ORG  Sat Oct 13 21:21:25 2012
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id D563BFB2
 for <freebsd-questions@freebsd.org>; Sat, 13 Oct 2012 21:21:25 +0000 (UTC)
 (envelope-from freebsd@edvax.de)
Received: from mx02.qsc.de (mx02.qsc.de [213.148.130.14])
 by mx1.freebsd.org (Postfix) with ESMTP id 919F78FC0A
 for <freebsd-questions@freebsd.org>; Sat, 13 Oct 2012 21:21:25 +0000 (UTC)
Received: from r56.edvax.de (port-92-195-110-131.dynamic.qsc.de
 [92.195.110.131]) by mx02.qsc.de (Postfix) with ESMTP id 7A29324510;
 Sat, 13 Oct 2012 23:21:24 +0200 (CEST)
Received: from r56.edvax.de (localhost [127.0.0.1])
 by r56.edvax.de (8.14.5/8.14.5) with SMTP id q9DLLOhb002315;
 Sat, 13 Oct 2012 23:21:24 +0200 (CEST)
 (envelope-from freebsd@edvax.de)
Date: Sat, 13 Oct 2012 23:21:24 +0200
From: Polytropon <freebsd@edvax.de>
To: Gary Kline <kline@thought.org>
Subject: Re: editing pdf files
Message-Id: <20121013232124.86e6355d.freebsd@edvax.de>
In-Reply-To: <20121013203816.GD14155@ethic.thought.org>
References: <5074A6B9.8040209@dreamchaser.org> <5078641D.4050905@passap.ru>
 <20121012234628.GA11112@ethic.thought.org>
 <20121013131907.c666bfc2.freebsd@edvax.de>
 <20121013203816.GD14155@ethic.thought.org>
Organization: EDVAX
X-Mailer: Sylpheed 3.1.1 (GTK+ 2.24.5; i386-portbld-freebsd8.2)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: FreeBSD Mailing List <freebsd-questions@freebsd.org>
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: Polytropon <freebsd@edvax.de>
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 13 Oct 2012 21:21:25 -0000

On Sat, 13 Oct 2012 13:38:16 -0700, Gary Kline wrote:
> On Sat, Oct 13, 2012 at 01:19:07PM +0200, Polytropon wrote:
> > On Fri, 12 Oct 2012 16:46:28 -0700, Gary Kline wrote:
> > > 	ive got a question that fits in here.  hopefully.
> > > 
> > > 	last week  I found a book from 1901 that google had scanned and listed
> > > 	as a pdf file.  it was text plus photos of the rich/famous of the 
> > > 	1800s.  somehow, google found the exact string that matched my great
> > > 	grandfather [from the civil war].  I d'loaded the file (maybe 2mbytes)
> > > 	and searched using acroread.  nada.  I used the pdftotext utility.
> > > 	same: nothing but  some 600 page numbers.
> > > 
> > > 	my guess is that google just took photos of the book and used other
> > > 	tools to create a pdf file.  I am not =that= serious  about genealogy,
> > > 	but I would like to know if there are any tools to edit this kind of
> > > 	pdf file.
> > 
> > In case the PDF is nothing more than a compilation of images,
> > there's a way to deal with it for editing:
> 
> 
> 	the images in this book aren't what I am interested in.
> 	just text.

In case the text is "in" images (i. e. the images contain text),
postprocessing those images will be the only way to obtain the
text information (if there is no actual text in the PDF).


> 	what fmt works best with the ocr suites?  or are they about the 
> 	same?  for the section I got in that 1901 book on my g-grandfather,
> 	it was only about 1.5 pages.  there was no photo, just his name 
> 	and some bio.  Still, things I had no knowledge of.  I'm sure 
> 	that my father didnt know either!

It should work with any lossless (!) format, especially if it does
only contain two colors (as any BW format of PBM, GIF and PNG can
do, and JPEG can't). In case tesseract OCR does not operate on
PBM files directly, convert them into something it can handle
better, like TIFF or maybe PNG; you can use

	% convert .-530.pbm 530.png
	% convert .-531.pbm 531.png

manually (as you will only process two files) and then run the OCR
process on them.

Note that pdfimages can also output color images (if they are color
images in the source), e. g. I found .-000.ppm (PPM format) with
a diagram in "Good Ideas, Through the Looking Glass" by N. Wirth.
I'm not sure if there could also "directly" be PNG or EPS files
in a PDF file...


-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...