No project description provided
Project description
Turning a pdf dataset into CSV
In the rare occasion where researchers publish an entire dataset written in PDF format, such as in the case of this research paper, you need a quick tool to turn that PDF into a CSV file.
What happens under the hood
Embeded nodejs runtime
Using the nodejs-bin
python node runtime, we are able to call npx
to remove execute the pdf2json npm package
When you type
pdfdataprocess mkjson
The npx
call applies on the first PDF in your directory. If not found, the path you provide to flag -f
is considered.
Parsing the JSON-PDF
Then a highly hacky python processing is applied to the JSON-PDF file.
The JSON-PDF is a JSON file containing the content of the PDF.
The python script iterates through the JSON-PDF and extracts the text content of each page.
How this is done is quite simple:
- if your dataset is an actual CSV file that's printed into PDF, then on each page you will have, at same height (
top
entry), a line of different values (left
entry changes,data
entry should be rearranged) - this requires a few layers of pre and post-processing
- the final result is a CSV file with the text content of each page
Usage
pdfdataprocess mkjson # outputs the json file
then
pdfdataprocess pjson # outputs the csv file
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdfdataprocess-0.1.1.tar.gz
.
File metadata
- Download URL: pdfdataprocess-0.1.1.tar.gz
- Upload date:
- Size: 3.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.11.3 Linux/6.2.0-20-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bb70474e748f3ba06184cd0c602a34c107e814a1871879a2e34a8ec8dc7fca7 |
|
MD5 | 9f3ef8c960cddfe544d435f5ed51a929 |
|
BLAKE2b-256 | 86abef83518f9a067ff9d57a9cf9d3353666915d3f1e1ab673779dbbffb36d1e |
Provenance
File details
Details for the file pdfdataprocess-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: pdfdataprocess-0.1.1-py3-none-any.whl
- Upload date:
- Size: 3.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.11.3 Linux/6.2.0-20-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 77d0d7bdb63500ed8d04516b5497eac12a86428318f2d6d5a063038817d7fc78 |
|
MD5 | cc0e7d022a6446eca445c65e8b57cddb |
|
BLAKE2b-256 | ecea394a893043009518e4877660caabdc5329dd120e490e2aef5093e3b989e7 |