Transforms PDF files into machine readable JSON files
Project description
pdf2data
Transforms PDF files into machine-readable JSON files. Extracts tables, figures, text blocks, metadata, and references from scientific papers and documents.
Note: The repository is under active development for an article publication. Some errors are expected. Please report any issues on the issues page.
Installation
From PyPI (recommended)
pip install pdf2data-tools
With optional dependencies
# For the full PDF2Data pipeline (layout detection, OCR, etc.)
pip install pdf2data-tools[pdf2data_pipeline]
From source (development)
conda create --name pdf2data python=3.10
conda activate pdf2data
git clone git@github.com:Pocoyo7798/pdf2data.git
cd pdf2data
pip install -e .
Usage
As a library
from pdf2data.pdf2data_pipeline import PDF2Data
pipeline = PDF2Data(
layout_model="DocLayout-YOLO-DocStructBench",
input_folder="path/to/pdfs",
output_folder="path/to/results",
)
Command line
# Extract tables and figures
pdf2data_block path_to_folder path_to_results
# Extract text
pdf2data_text path_to_folder path_to_results
# Extract metadata
pdf2data_metadata path_to_folder path_to_results
# Extract references
pdf2data_references path_to_folder path_to_results
Update and Publish (PyPI)
Use this flow when you make changes and want to publish a new package version.
# 1) Bump version in pyproject.toml
# [project]
# version = "0.0.2"
# 2) (Optional) Keep __version__ in sync
# edit pdf2data/__init__.py
# 3) Install/reinstall build tools
python -m pip install --upgrade build twine
# 4) Clean previous artifacts
rm -rf dist build *.egg-info
# 5) Build package
python -m build
# 6) Validate distribution files
python -m twine check dist/*
# 7) Upload to PyPI
python -m twine upload dist/*
When prompted by twine:
- Username:
__token__ - Password: your PyPI token (
pypi-...)
Verify the release:
pip install --upgrade pdf2data-tools
pip show pdf2data-tools
License
Apache Software License 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf2data_tools-0.1.1.tar.gz.
File metadata
- Download URL: pdf2data_tools-0.1.1.tar.gz
- Upload date:
- Size: 72.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40da823fbc3ff820c27bcb5039931acc6a23553044624131f78b7dea14fc0222
|
|
| MD5 |
51e9e8f0ea7cec31212ff2b5a21b9b62
|
|
| BLAKE2b-256 |
165632caaa8f8578e9d67f91d08347e5eecf6c5ebcfaaeadc80a3a11184fde84
|
File details
Details for the file pdf2data_tools-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pdf2data_tools-0.1.1-py3-none-any.whl
- Upload date:
- Size: 79.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
136d0da2ab293500092af8bd50f55b347398b660e5c9628766f635413cca6269
|
|
| MD5 |
a5ae50934bd5124bf7287ac3c1fa3368
|
|
| BLAKE2b-256 |
4eaee0165a4c13bbd88283d6715b8e344ec8b329d8d5e0e06ebb890332b232f3
|