From owner-freebsd-questions@FreeBSD.ORG Sat Oct 13 20:07:04 2012 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id DF0F3A3B for ; Sat, 13 Oct 2012 20:07:04 +0000 (UTC) (envelope-from kline@thought.org) Received: from p3plsmtpa06-02.prod.phx3.secureserver.net (p3plsmtpa06-02.prod.phx3.secureserver.net [173.201.192.103]) by mx1.freebsd.org (Postfix) with ESMTP id B785D8FC12 for ; Sat, 13 Oct 2012 20:07:04 +0000 (UTC) Received: from ethic.thought.org ([209.180.213.209]) by p3plsmtpa06-02.prod.phx3.secureserver.net with id Ak4T1k0034XeM0101k4TWQ; Sat, 13 Oct 2012 13:04:28 -0700 Date: Sat, 13 Oct 2012 13:04:27 -0700 From: Gary Kline To: "C. P. Ghost" Subject: Re: editing pdf files Message-ID: <20121013200427.GC14155@ethic.thought.org> References: <5074A6B9.8040209@dreamchaser.org> <5078641D.4050905@passap.ru> <20121012234628.GA11112@ethic.thought.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-questions@freebsd.org X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Oct 2012 20:07:04 -0000 On Sat, Oct 13, 2012 at 04:40:23AM +0200, C. P. Ghost wrote: > On Sat, Oct 13, 2012 at 1:46 AM, Gary Kline wrote: > > On Fri, Oct 12, 2012 at 10:40:29PM +0400, Boris Samorodov wrote: > >> 10.10.2012 02:35, Gary Aitken пишет: > >> > >> > Can someone give me advice on editing pdf files? > >> > >> Take a look at graphics/inkscape. > >> > >> -- > >> WBR, Boris Samorodov (bsam) > >> FreeBSD Committer, http://www.FreeBSD.org The Power To Serve > > > > > > ive got a question that fits in here. hopefully. > > > > last week I found a book from 1901 that google had scanned and listed > > as a pdf file. it was text plus photos of the rich/famous of the > > 1800s. somehow, google found the exact string that matched my great > > grandfather [from the civil war]. I d'loaded the file (maybe 2mbytes) > > and searched using acroread. nada. I used the pdftotext utility. > > same: nothing but some 600 page numbers. > > > > my guess is that google just took photos of the book and used other > > tools to create a pdf file. I am not =that= serious about genealogy, > > but I would like to know if there are any tools to edit this kind of > > pdf file. > > I suspect the following: they scanned the book and put all the images > into the PDF. The PDF itself is merely a container for scanned pages; > it thus contains no text (save for the page numbers). > > That Google was able to search in this file is probably due to them running > some OCR program on the image files, and then indexing the (approximate) > text that the OCR program generated. Probably they used something like > tesseract-ocr from ports graphics/tesseract: > http://code.google.com/p/tesseract-ocr/ > in more recent google stuff--text--sci-tech zines or whatever--it sseems like they have used some very high-end ocr programs and =then= turned the file into pdf. I have been able to get very good textfiles from a small sample of google's work. a few years ago I tried the ocr ports we have. very poor results. it may be time to see if the newer versions gives me better results. gary ps: tesseract was one I tried [circa '10] ... time to look at the actual Code! > > -cpghost. > > -- > Cordula's Web. http://www.cordula.ws/ -- Gary Kline kline@thought.org http://www.thought.org Public Service Unix Twenty-six years of service to the Unix community.