Skip to main content

pdf2struct: extract structured JSON from PDFs (text, metadata, tables, OCR, invoice key-value fields).

Project description

pdf2struct

pdf2struct extracts structured JSON from PDF documents.

Supports:

  • text extraction
  • metadata extraction
  • table extraction
  • OCR fallback (optional)
  • invoice-like key-value extraction

Install

pip install pdf2struct

OCR support:

pip install pdf2struct[ocr]

CLI

Extract PDF:

pdf2struct input.pdf --out output.json

Extract with OCR:

pdf2struct input.pdf --ocr --out output.json

Output JSON structure

  • metadata
  • pages (text + tables)
  • detected_fields (invoice key-value pairs)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2struct-0.2.0.tar.gz (4.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2struct-0.2.0-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file pdf2struct-0.2.0.tar.gz.

File metadata

  • Download URL: pdf2struct-0.2.0.tar.gz
  • Upload date:
  • Size: 4.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf2struct-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d03b17dddff747d3a75e95d26f1dc906c6444a923f22e9714d0cd227cdb69c5f
MD5 bf4980cf58022dd58a3d7b82bf44c6ec
BLAKE2b-256 3a9073fbb4fc43c1f788b368470d84682e3fa1fc0ffcb4d84ed48d129c9965f1

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2struct-0.2.0.tar.gz:

Publisher: publish.yml on Kubenew/pdf2struct

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pdf2struct-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pdf2struct-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf2struct-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6397c71eb4863fdff1872483ba9ccfc083cc8963d2e863a120cefad56b57dcf9
MD5 501911858474faac707b83df862c541b
BLAKE2b-256 c571949157e2b799e55e8443454fbde2af251d9684ac1fade9efd20e9aac7fb7

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2struct-0.2.0-py3-none-any.whl:

Publisher: publish.yml on Kubenew/pdf2struct

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page