No project description provided
Project description
Turning a pdf dataset into CSV
In the rare occasion where researchers publish an entire dataset written in PDF format, such as in the case of this research paper, you need a quick tool to turn that PDF into a CSV file.
What happens under the hood
Embeded nodejs runtime
Using the nodejs-bin python node runtime, we are able to call npx to remove execute the pdf2json npm package
When you type
pdfdataprocess mkjson
The npx call applies on the first PDF in your directory. If not found, the path you provide to flag -f is considered.
Parsing the JSON-PDF
Then a highly hacky python processing is applied to the JSON-PDF file.
The JSON-PDF is a JSON file containing the content of the PDF.
The python script iterates through the JSON-PDF and extracts the text content of each page.
How this is done is quite simple:
- if your dataset is an actual CSV file that's printed into PDF, then on each page you will have, at same height (
topentry), a line of different values (leftentry changes,dataentry should be rearranged) - this requires a few layers of pre and post-processing
- the final result is a CSV file with the text content of each page
Usage
pdfdataprocess mkjson # outputs the json file
then
pdfdataprocess pjson # outputs the csv file
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfdataprocess-0.1.2.tar.gz.
File metadata
- Download URL: pdfdataprocess-0.1.2.tar.gz
- Upload date:
- Size: 3.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.11.3 Linux/6.2.0-20-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcdbe18eab9e6203e234e46241264de15c3aa1c8b37c4a42537ccb8d736e74df
|
|
| MD5 |
30c869c3429ffad51236828a91ff955d
|
|
| BLAKE2b-256 |
70734d56b7fcd532e3ea71f76e464507adfaeca7d387d1853aa4f7d74dec71cb
|
File details
Details for the file pdfdataprocess-0.1.2-py3-none-any.whl.
File metadata
- Download URL: pdfdataprocess-0.1.2-py3-none-any.whl
- Upload date:
- Size: 3.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.11.3 Linux/6.2.0-20-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d17368cc8bb1df5fe2b667388577a639cfacba2cf18bfe2092d28e04d3bf3693
|
|
| MD5 |
a56615f24d850b57ef262f2da5c801e7
|
|
| BLAKE2b-256 |
31800328d94cbf650e71610371a1945a3b1147e1578080125559ced2940b9fb9
|