Skip to main content

No project description provided

Project description

Turning a pdf dataset into CSV

In the rare occasion where researchers publish an entire dataset written in PDF format, such as in the case of this research paper, you need a quick tool to turn that PDF into a CSV file.

What happens under the hood

Embeded nodejs runtime

Using the nodejs-bin python node runtime, we are able to call npx to remove execute the pdf2json npm package

When you type

pdfdataprocess mkjson

The npx call applies on the first PDF in your directory. If not found, the path you provide to flag -f is considered.

Parsing the JSON-PDF

Then a highly hacky python processing is applied to the JSON-PDF file.

The JSON-PDF is a JSON file containing the content of the PDF.

The python script iterates through the JSON-PDF and extracts the text content of each page.

How this is done is quite simple:

  • if your dataset is an actual CSV file that's printed into PDF, then on each page you will have, at same height (top entry), a line of different values (left entry changes, data entry should be rearranged)
  • this requires a few layers of pre and post-processing
  • the final result is a CSV file with the text content of each page

Usage

pdfdataprocess mkjson # outputs the json file

then

pdfdataprocess pjson # outputs the csv file

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfdataprocess-0.1.2.tar.gz (3.0 kB view hashes)

Uploaded Source

Built Distribution

pdfdataprocess-0.1.2-py3-none-any.whl (3.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page