Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 23 Jan 2021 05:42:09 +0100
From:      Polytropon <freebsd@edvax.de>
To:        Odhiambo Washington <odhiambo@gmail.com>
Cc:        User Questions <freebsd-questions@freebsd.org>
Subject:   Re: Convert PDF to Excel
Message-ID:  <20210123054209.f03ac420.freebsd@edvax.de>
In-Reply-To: <CAAdA2WPoqEaew-OuDwAJ4pTbNUJsAzc2MpZE9di5HrJfGu%2Bexw@mail.gmail.com>
References:  <CAAdA2WPoqEaew-OuDwAJ4pTbNUJsAzc2MpZE9di5HrJfGu%2Bexw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 22 Jan 2021 19:45:11 +0300, Odhiambo Washington wrote:
> I have a situation where I'd like to convert PDF to XLSX.
> The documents are 35MB and 105MB but contain several thousand pages.
> 
> Does anyone know a good tool that can handle this?

Depends on what is in the PDFs.

If this is rendered text, you can maybe extract the text with
the tool pdftotext and convert it to CSV, then import the CSV
in "Excel".

But if it's images of text, use the tool pdfimages to extract the
images, and then a OCR tool (maybe esseract) to obtain the data.

It might be worth checking if LibreOffice an open a PDF file and
export to (or save as) directly an "Excel"-compatible file, either
CSV or one of the binary formats (XLS, XLSX).

Restructuring with some sed / awk / perl might be needed, though.
Keep in mind those steps can be automated, so if you have lots of
PDF files, write a simple shell wrapper that processes all of them,
so you get a bunch of result files without further handholding. :-)



-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20210123054209.f03ac420.freebsd>