Skip to main content

LangChain document loader for pdfmux — self-healing PDF extraction

Project description

langchain-pdfmux

PyPI version Python versions License: MIT

LangChain document loader for pdfmux -- self-healing PDF extraction for RAG pipelines.

Why pdfmux?

Most PDF loaders use a single extraction method and silently fail on complex layouts. pdfmux routes each page through the best extraction pipeline automatically:

  • Smart routing -- selects the optimal parser per page (text-heavy, scanned, tables, mixed)
  • Confidence scoring -- every chunk includes a confidence score so your RAG pipeline can filter or re-rank
  • Self-healing -- retries with alternative extractors when the primary one returns low-quality output

Install

pip install langchain-pdfmux

Usage

from langchain_pdfmux import PDFMuxLoader

docs = PDFMuxLoader("report.pdf").load()

Each Document includes metadata with extraction quality signals:

loader = PDFMuxLoader("report.pdf", quality="high")
for doc in loader.lazy_load():
    print(doc.metadata)
    # {
    #   "source": "report.pdf",
    #   "title": "Q4 Results",
    #   "page_start": 1,
    #   "page_end": 3,
    #   "tokens": 820,
    #   "confidence": 0.94
    # }

Options

# Quality presets: "fast", "standard" (default), "high"
loader = PDFMuxLoader("report.pdf", quality="high")

# Load all PDFs in a directory
loader = PDFMuxLoader("./papers/")

# Custom glob pattern
loader = PDFMuxLoader("./papers/", glob="**/*.pdf")

# Streaming with lazy_load
for doc in PDFMuxLoader("large.pdf").lazy_load():
    process(doc)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_pdfmux-0.2.0.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_pdfmux-0.2.0-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file langchain_pdfmux-0.2.0.tar.gz.

File metadata

  • Download URL: langchain_pdfmux-0.2.0.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for langchain_pdfmux-0.2.0.tar.gz
Algorithm Hash digest
SHA256 118e0083f71e8904decef1b6f60267b1dfa5a1c864995f681e5a6b60386662bc
MD5 78fefacb31065216d9f893b2742c627b
BLAKE2b-256 ecc899e521c502d81cf5c571cf2584baf58a063f32d0419fab650f43f9d7e71d

See more details on using hashes here.

File details

Details for the file langchain_pdfmux-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_pdfmux-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0947cec5a963a76bc6bd68e72a3ed0a534ed79870f180de44db0419322cf5f84
MD5 720756f7e07bb9f46fe6c3475c18dfb0
BLAKE2b-256 c8d3576c9e32a085a1a36221b0a71ac6c61abc5a021b95bacbbe5ca3e24cb666

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page