Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 22 Jan 2021 20:58:38 -0800
From:      Weaver <weaver@riseup.net>
To:        freebsd-questions@freebsd.org
Subject:   Re: Convert PDF to Excel
Message-ID:  <8236dabc52a7801b0cb7edce8c954623@riseup.net>
In-Reply-To: <20210123054209.f03ac420.freebsd@edvax.de>
References:  <CAAdA2WPoqEaew-OuDwAJ4pTbNUJsAzc2MpZE9di5HrJfGu%2Bexw@mail.gmail.com> <20210123054209.f03ac420.freebsd@edvax.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On 23-01-2021 14:42, Polytropon wrote:
> On Fri, 22 Jan 2021 19:45:11 +0300, Odhiambo Washington wrote:
>> I have a situation where I'd like to convert PDF to XLSX.
>> The documents are 35MB and 105MB but contain several thousand pages.
>>
>> Does anyone know a good tool that can handle this?
> 
> Depends on what is in the PDFs.
> 
> If this is rendered text, you can maybe extract the text with
> the tool pdftotext and convert it to CSV, then import the CSV
> in "Excel".

Or, Abiword has a .pdf import plugin.
These are never perfect, however, and some extensive editing may be
necessary, depending on the document.
Then you could import it into Gnumeric and use ssconvert to convert it
into any of four Excel formats.

 
> But if it's images of text, use the tool pdfimages to extract the
> images, and then a OCR tool (maybe esseract) to obtain the data.
> 
> It might be worth checking if LibreOffice an open a PDF file and
> export to (or save as) directly an "Excel"-compatible file, either
> CSV or one of the binary formats (XLS, XLSX).
> 
> Restructuring with some sed / awk / perl might be needed, though.
> Keep in mind those steps can be automated, so if you have lots of
> PDF files, write a simple shell wrapper that processes all of them,
> so you get a bunch of result files without further handholding. :-)

-- 
`The greatest obstacle to discovery is not ignorance; 
it is the illusion of knowledge'.
--Daniel J. Boorstein



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8236dabc52a7801b0cb7edce8c954623>