Skip to main content

No project description provided

Project description

Turning a pdf dataset into CSV

In the rare occasion where researchers publish an entire dataset written in PDF format, such as in the case of this research paper, you need a quick tool to turn that PDF into a CSV file.

What happens under the hood

Embeded nodejs runtime

Using the nodejs-bin python node runtime, we are able to call npx to remove execute the pdf2json npm package

When you type

pdfdataprocess mkjson

The npx call applies on the first PDF in your directory. If not found, the path you provide to flag -f is considered.

Parsing the JSON-PDF

Then a highly hacky python processing is applied to the JSON-PDF file.

The JSON-PDF is a JSON file containing the content of the PDF.

The python script iterates through the JSON-PDF and extracts the text content of each page.

How this is done is quite simple:

  • if your dataset is an actual CSV file that's printed into PDF, then on each page you will have, at same height (top entry), a line of different values (left entry changes, data entry should be rearranged)
  • this requires a few layers of pre and post-processing
  • the final result is a CSV file with the text content of each page

Usage

pdfdataprocess mkjson # outputs the json file

then

pdfdataprocess pjson # outputs the csv file

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfdataprocess-0.1.2.tar.gz (3.0 kB view details)

Uploaded Source

Built Distribution

pdfdataprocess-0.1.2-py3-none-any.whl (3.9 kB view details)

Uploaded Python 3

File details

Details for the file pdfdataprocess-0.1.2.tar.gz.

File metadata

  • Download URL: pdfdataprocess-0.1.2.tar.gz
  • Upload date:
  • Size: 3.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.11.3 Linux/6.2.0-20-generic

File hashes

Hashes for pdfdataprocess-0.1.2.tar.gz
Algorithm Hash digest
SHA256 fcdbe18eab9e6203e234e46241264de15c3aa1c8b37c4a42537ccb8d736e74df
MD5 30c869c3429ffad51236828a91ff955d
BLAKE2b-256 70734d56b7fcd532e3ea71f76e464507adfaeca7d387d1853aa4f7d74dec71cb

See more details on using hashes here.

File details

Details for the file pdfdataprocess-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: pdfdataprocess-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 3.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.11.3 Linux/6.2.0-20-generic

File hashes

Hashes for pdfdataprocess-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d17368cc8bb1df5fe2b667388577a639cfacba2cf18bfe2092d28e04d3bf3693
MD5 a56615f24d850b57ef262f2da5c801e7
BLAKE2b-256 31800328d94cbf650e71610371a1945a3b1147e1578080125559ced2940b9fb9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page