Skip to main content

Transforms PDF files into machine readable JSON files

Project description

pdf2data

PyPI version License

Transforms PDF files into machine-readable JSON files. Extracts tables, figures, text blocks, metadata, and references from scientific papers and documents.

Note: The repository is under active development for an article publication. Some errors are expected. Please report any issues on the issues page.

Installation

From PyPI (recommended)

pip install pdf2data-tools

With optional dependencies

# For the full PDF2Data pipeline (layout detection, OCR, etc.)
pip install pdf2data-tools[pdf2data_pipeline]

From source (development)

conda create --name pdf2data python=3.10
conda activate pdf2data
git clone git@github.com:Pocoyo7798/pdf2data.git
cd pdf2data
pip install -e .

Usage

As a library

from pdf2data.pdf2data_pipeline import PDF2Data

pipeline = PDF2Data(
    layout_model="DocLayout-YOLO-DocStructBench",
    input_folder="path/to/pdfs",
    output_folder="path/to/results",
)

Command line

# Extract tables and figures
pdf2data_block path_to_folder path_to_results

# Extract text
pdf2data_text path_to_folder path_to_results

# Extract metadata
pdf2data_metadata path_to_folder path_to_results

# Extract references
pdf2data_references path_to_folder path_to_results

Update and Publish (PyPI)

Use this flow when you make changes and want to publish a new package version.

# 1) Bump version in pyproject.toml
# [project]
# version = "0.0.2"

# 2) (Optional) Keep __version__ in sync
# edit pdf2data/__init__.py

# 3) Install/reinstall build tools
python -m pip install --upgrade build twine

# 4) Clean previous artifacts
rm -rf dist build *.egg-info

# 5) Build package
python -m build

# 6) Validate distribution files
python -m twine check dist/*

# 7) Upload to PyPI
python -m twine upload dist/*

When prompted by twine:

  • Username: __token__
  • Password: your PyPI token (pypi-...)

Verify the release:

pip install --upgrade pdf2data-tools
pip show pdf2data-tools

License

Apache Software License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2data_tools-0.0.4.tar.gz (69.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2data_tools-0.0.4-py3-none-any.whl (76.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf2data_tools-0.0.4.tar.gz.

File metadata

  • Download URL: pdf2data_tools-0.0.4.tar.gz
  • Upload date:
  • Size: 69.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for pdf2data_tools-0.0.4.tar.gz
Algorithm Hash digest
SHA256 d69028eee0b4b040344d3e614df4353d255568f57263a9d928e383c98740cab4
MD5 f1ed66f9dca32bfb7a94cfe76f7af267
BLAKE2b-256 25f3ca9d3cb62abac661e4ad94ec12a3b480c9d6fae0562c0eed9a551d947ba3

See more details on using hashes here.

File details

Details for the file pdf2data_tools-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: pdf2data_tools-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 76.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for pdf2data_tools-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2fa3896bc06af5b215b46d43a31b5280f6ba0ab8a15d6b791560344b9780a0e7
MD5 7fdfb6d6ea6037f569d09ed557051647
BLAKE2b-256 c10c6421ddbe1182d057eec3ff63142fcc9a5bd9de218c09252b11a5e0b78dfa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page