Skip to main content

No project description provided

Project description

Turning a pdf dataset into CSV

In the rare occasion where researchers publish an entire dataset written in PDF format, such as in the case of this research paper, you need a quick tool to turn that PDF into a CSV file.

What happens under the hood

Embeded nodejs runtime

Using the nodejs-bin python node runtime, we are able to call npx to remove execute the pdf2json npm package

When you type

pdfdataprocess mkjson

The npx call applies on the first PDF in your directory. If not found, the path you provide to flag -f is considered.

Parsing the JSON-PDF

Then a highly hacky python processing is applied to the JSON-PDF file.

The JSON-PDF is a JSON file containing the content of the PDF.

The python script iterates through the JSON-PDF and extracts the text content of each page.

How this is done is quite simple:

  • if your dataset is an actual CSV file that's printed into PDF, then on each page you will have, at same height (top entry), a line of different values (left entry changes, data entry should be rearranged)
  • this requires a few layers of pre and post-processing
  • the final result is a CSV file with the text content of each page

Usage

pdfdataprocess mkjson # outputs the json file

then

pdfdataprocess pjson # outputs the csv file

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfdataprocess-0.1.1.tar.gz (3.0 kB view details)

Uploaded Source

Built Distribution

pdfdataprocess-0.1.1-py3-none-any.whl (3.9 kB view details)

Uploaded Python 3

File details

Details for the file pdfdataprocess-0.1.1.tar.gz.

File metadata

  • Download URL: pdfdataprocess-0.1.1.tar.gz
  • Upload date:
  • Size: 3.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.11.3 Linux/6.2.0-20-generic

File hashes

Hashes for pdfdataprocess-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8bb70474e748f3ba06184cd0c602a34c107e814a1871879a2e34a8ec8dc7fca7
MD5 9f3ef8c960cddfe544d435f5ed51a929
BLAKE2b-256 86abef83518f9a067ff9d57a9cf9d3353666915d3f1e1ab673779dbbffb36d1e

See more details on using hashes here.

Provenance

File details

Details for the file pdfdataprocess-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pdfdataprocess-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 3.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.11.3 Linux/6.2.0-20-generic

File hashes

Hashes for pdfdataprocess-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 77d0d7bdb63500ed8d04516b5497eac12a86428318f2d6d5a063038817d7fc78
MD5 cc0e7d022a6446eca445c65e8b57cddb
BLAKE2b-256 ecea394a893043009518e4877660caabdc5329dd120e490e2aef5093e3b989e7

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page