Skip to main content

Transforms PDF files into machine readable JSON files

Project description

pdf2data

PyPI version License

Transforms PDF files into machine-readable JSON files. Extracts tables, figures, text blocks, metadata, and references from scientific papers and documents.

Note: The repository is under active development for an article publication. Some errors are expected. Please report any issues on the issues page.

Installation

From PyPI (recommended)

pip install pdf2data-tools

With optional dependencies

# For the full PDF2Data pipeline (layout detection, OCR, etc.)
pip install pdf2data-tools[pdf2data_pipeline]

From source (development)

conda create --name pdf2data python=3.10
conda activate pdf2data
git clone git@github.com:Pocoyo7798/pdf2data.git
cd pdf2data
pip install -e .

Usage

As a library

from pdf2data.pdf2data_pipeline import PDF2Data

pipeline = PDF2Data(
    layout_model="DocLayout-YOLO-DocStructBench",
    input_folder="path/to/pdfs",
    output_folder="path/to/results",
)

Command line

# Extract tables and figures
pdf2data_block path_to_folder path_to_results

# Extract text
pdf2data_text path_to_folder path_to_results

# Extract metadata
pdf2data_metadata path_to_folder path_to_results

# Extract references
pdf2data_references path_to_folder path_to_results

License

Apache Software License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2data_tools-0.0.1.tar.gz (64.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2data_tools-0.0.1-py2.py3-none-any.whl (65.2 kB view details)

Uploaded Python 2Python 3

File details

Details for the file pdf2data_tools-0.0.1.tar.gz.

File metadata

  • Download URL: pdf2data_tools-0.0.1.tar.gz
  • Upload date:
  • Size: 64.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for pdf2data_tools-0.0.1.tar.gz
Algorithm Hash digest
SHA256 ccbf036a2690b9f81cc0e811cb4232bf9210588f77915c9b2f2cf8ed2cfe8bf4
MD5 2e2ef955984a961534dd017df18f2a0c
BLAKE2b-256 03d453e27188d9256951c7fab82bcb0b4689ddb425323c070a934c29245c4faf

See more details on using hashes here.

File details

Details for the file pdf2data_tools-0.0.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for pdf2data_tools-0.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 60b6dc34151185dae9b63b1176ca6552db8775bcf21635a5eb2f3538d8998301
MD5 b52d73c540d146209128db0e0cf79008
BLAKE2b-256 a13118824280ba103c8bcf91a732eca3b1bb420d5597aedc0e045d0132bbf563

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page