Skip to main content

LlamaIndex reader for pdfmux -- self-healing PDF extraction for RAG pipelines

Project description

llama-index-readers-pdfmux

PyPI version Python versions License: MIT

LlamaIndex reader for pdfmux -- self-healing PDF extraction for RAG pipelines.

Why pdfmux?

Most PDF loaders use a single extraction method and silently fail on complex layouts. pdfmux routes each page through the best extraction pipeline automatically:

  • Smart routing -- selects the optimal parser per page (text-heavy, scanned, tables, mixed)
  • Confidence scoring -- every chunk includes a confidence score so your RAG pipeline can filter or re-rank
  • Self-healing -- retries with alternative extractors when the primary one returns low-quality output

Install

pip install llama-index-readers-pdfmux

Usage

from llama_index_readers_pdfmux import PDFMuxReader

reader = PDFMuxReader()
docs = reader.load_data("report.pdf")

Each Document includes metadata with extraction quality signals:

reader = PDFMuxReader(quality="high")
for doc in reader.load_data("report.pdf"):
    print(doc.metadata)
    # {
    #   "source": "report.pdf",
    #   "title": "Q4 Results",
    #   "page_start": 1,
    #   "page_end": 3,
    #   "tokens": 820,
    #   "confidence": 0.94
    # }

Options

# Quality presets: "fast", "standard" (default), "high"
reader = PDFMuxReader(quality="high")

# Load all PDFs in a directory
docs = reader.load_data("./papers/")

# Custom glob pattern
reader = PDFMuxReader(glob="**/*.pdf")
docs = reader.load_data("./papers/")

# Attach extra metadata
docs = reader.load_data("report.pdf", extra_info={"project": "Q4 analysis"})

With LlamaIndex pipelines

from llama_index.core import VectorStoreIndex
from llama_index_readers_pdfmux import PDFMuxReader

reader = PDFMuxReader(quality="high")
docs = reader.load_data("./papers/")

# Filter low-confidence chunks
docs = [d for d in docs if d.metadata["confidence"] > 0.8]

index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response = query_engine.query("What were the key findings?")

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_pdfmux-0.1.0.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_readers_pdfmux-0.1.0-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_readers_pdfmux-0.1.0.tar.gz.

File metadata

File hashes

Hashes for llama_index_readers_pdfmux-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2b46edddab075c2b7770491df1cfabe8272a83fd17d0b090bb63bdf10864dd96
MD5 fe20e48af358f226c086fde9bb5209d5
BLAKE2b-256 d5db8ee66aabb7a490acf08c47e1a9a9128298373564b70f7e1bcc5101dee172

See more details on using hashes here.

File details

Details for the file llama_index_readers_pdfmux-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_readers_pdfmux-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a046db382bd6286a2ee1bfa2b610e26f026a184455adbedf860c2b836e1d4bef
MD5 9def97fb8139abf90e126e7e033dd1c6
BLAKE2b-256 dee7d13db0068a4c81e8745736129622d03c423025c0100c1d9b9ca80664a2e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page