No project description provided

These details have not been verified by PyPI

Project links

Homepage

Project description

Turning a pdf dataset into CSV

In the rare occasion where researchers publish an entire dataset written in PDF format, such as in the case of this research paper, you need a quick tool to turn that PDF into a CSV file.

What happens under the hood

Embeded nodejs runtime

Using the nodejs-bin python node runtime, we are able to call npx to remove execute the pdf2json npm package

When you type

pdfdataprocess mkjson

The npx call applies on the first PDF in your directory. If not found, the path you provide to flag -f is considered.

Parsing the JSON-PDF

Then a highly hacky python processing is applied to the JSON-PDF file.

The JSON-PDF is a JSON file containing the content of the PDF.

The python script iterates through the JSON-PDF and extracts the text content of each page.

How this is done is quite simple:

if your dataset is an actual CSV file that's printed into PDF, then on each page you will have, at same height (top entry), a line of different values (left entry changes, data entry should be rearranged)
this requires a few layers of pre and post-processing
the final result is a CSV file with the text content of each page

Usage

pdfdataprocess mkjson # outputs the json file

then

pdfdataprocess pjson # outputs the csv file

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.2

May 28, 2023

0.1.1

May 28, 2023

0.1.0

May 28, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfdataprocess-0.1.2.tar.gz (3.0 kB view details)

Uploaded May 28, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdfdataprocess-0.1.2-py3-none-any.whl (3.9 kB view details)

Uploaded May 28, 2023 Python 3

File details

Details for the file pdfdataprocess-0.1.2.tar.gz.

File metadata

Download URL: pdfdataprocess-0.1.2.tar.gz
Upload date: May 28, 2023
Size: 3.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.2 CPython/3.11.3 Linux/6.2.0-20-generic

File hashes

Hashes for pdfdataprocess-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`fcdbe18eab9e6203e234e46241264de15c3aa1c8b37c4a42537ccb8d736e74df`
MD5	`30c869c3429ffad51236828a91ff955d`
BLAKE2b-256	`70734d56b7fcd532e3ea71f76e464507adfaeca7d387d1853aa4f7d74dec71cb`

See more details on using hashes here.

File details

Details for the file pdfdataprocess-0.1.2-py3-none-any.whl.

File metadata

Download URL: pdfdataprocess-0.1.2-py3-none-any.whl
Upload date: May 28, 2023
Size: 3.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.2 CPython/3.11.3 Linux/6.2.0-20-generic

File hashes

Hashes for pdfdataprocess-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d17368cc8bb1df5fe2b667388577a639cfacba2cf18bfe2092d28e04d3bf3693`
MD5	`a56615f24d850b57ef262f2da5c801e7`
BLAKE2b-256	`31800328d94cbf650e71610371a1945a3b1147e1578080125559ced2940b9fb9`

See more details on using hashes here.

pdfdataprocess 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Turning a pdf dataset into CSV

What happens under the hood

Embeded nodejs runtime

Parsing the JSON-PDF

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes