Skip to main content

LlamaIndex reader for OpenDataLoader PDF — fast, accurate, local PDF extraction

Project description

opendataloader-pdf-llamaindex

LlamaIndex reader for OpenDataLoader PDF — parse PDFs into structured Document objects for RAG pipelines.

For the full feature set of the core engine (hybrid AI mode, OCR, formula extraction, benchmarks, accessibility), see the OpenDataLoader PDF documentation.

PyPI version License

Features

  • Accurate reading order — XY-Cut++ algorithm handles multi-column layouts correctly
  • Table extraction — Preserves table structure in output
  • Multiple formats — Text, Markdown, JSON (with bounding boxes), HTML
  • Per-page splitting — Each page becomes a separate Document with page number metadata
  • AI safety — Built-in prompt injection filtering (hidden text, off-page content, invisible layers)
  • 100% local — No cloud APIs, your documents never leave your machine
  • Fast — Rule-based extraction, no GPU required

Requirements

  • Python >= 3.10
  • Java 11+ available on system PATH

Verify Java is installed:

java -version

Installation

pip install -U opendataloader-pdf-llamaindex

Quick Start

from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader

reader = OpenDataLoaderPDFReader(format="text")
documents = reader.load_data(file_path="document.pdf")

print(documents[0].text)
print(documents[0].metadata)
# {'source': 'document.pdf', 'format': 'text', 'page': 1}

SimpleDirectoryReader Integration

Use with LlamaIndex's SimpleDirectoryReader via the file_extractor parameter:

from llama_index.core import SimpleDirectoryReader
from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader

reader = SimpleDirectoryReader(
    input_dir="./documents",
    file_extractor={".pdf": OpenDataLoaderPDFReader(format="markdown")}
)
documents = reader.load_data()

Usage Examples

Output Formats

from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader

# Plain text (default) — best for simple RAG
reader = OpenDataLoaderPDFReader(format="text")

# Markdown — preserves headings, lists, tables
reader = OpenDataLoaderPDFReader(format="markdown")

# JSON — structured data with bounding boxes for source citations
reader = OpenDataLoaderPDFReader(format="json")

# HTML — styled output
reader = OpenDataLoaderPDFReader(format="html")

Tagged PDF Support

For accessible PDFs with structure tags (common in government/legal documents):

reader = OpenDataLoaderPDFReader(use_struct_tree=True)

Table Detection

reader = OpenDataLoaderPDFReader(
    format="markdown",
    table_method="cluster"  # Better for borderless tables
)

Sensitive Data Sanitization

reader = OpenDataLoaderPDFReader(sanitize=True)
# Replaces emails, phone numbers, IPs, credit cards, URLs with placeholders

Page Selection

reader = OpenDataLoaderPDFReader(pages="1,3,5-7")

Headers and Footers

reader = OpenDataLoaderPDFReader(include_header_footer=True)

Password-Protected PDFs

reader = OpenDataLoaderPDFReader(password="secret")

Image Handling

# Embed images as Base64 in output
reader = OpenDataLoaderPDFReader(image_output="embedded")

# Save images to external files
reader = OpenDataLoaderPDFReader(
    image_output="external",
    image_dir="./extracted_images"
)

Hybrid AI Mode

For higher accuracy on complex documents (requires a running hybrid backend):

reader = OpenDataLoaderPDFReader(
    hybrid="docling-fast",
    hybrid_fallback=True  # Fall back to Java on backend failure
)

RAG Pipeline Example

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader

# Load PDFs
reader = SimpleDirectoryReader(
    input_dir="./documents",
    file_extractor={".pdf": OpenDataLoaderPDFReader(format="markdown")}
)
documents = reader.load_data()

# Build index and query
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What are the key findings?")
print(response)

Parameters

Parameter Type Default Description
format str "text" Output format: "text", "markdown", "json", "html"
split_pages bool True Split output into separate Documents per page
quiet bool False Suppress CLI logging output
content_safety_off list[str] None Safety filters to disable: "all", "hidden-text", "off-page", "tiny", "hidden-ocg"
password str None Password for encrypted PDFs
keep_line_breaks bool False Preserve original line breaks
replace_invalid_chars str None Replacement for unrecognized characters
use_struct_tree bool False Use PDF structure tree (tagged PDFs)
table_method str None "default" (border-based) or "cluster" (border + cluster)
reading_order str None "off" or "xycut" (default when not specified)
image_output str "off" "off", "embedded" (Base64), "external" (files)
image_format str None "png" or "jpeg"
image_dir str None Directory for external images
sanitize bool False Mask emails, phones, IPs, credit cards, URLs
pages str None Pages to extract, e.g., "1,3,5-7"
include_header_footer bool False Include page headers and footers
detect_strikethrough bool False Detect strikethrough text (experimental)
hybrid str None Hybrid AI backend: "docling-fast"
hybrid_mode str None "auto" (complex pages only) or "full" (all pages)
hybrid_url str None Custom backend server URL
hybrid_timeout str None Backend timeout in milliseconds
hybrid_fallback bool False Fall back to Java on backend failure

Document Metadata

Each Document includes metadata:

With split_pages=True (default):

{"source": "document.pdf", "format": "text", "page": 1}

With split_pages=False:

{"source": "document.pdf", "format": "text"}

With hybrid mode:

{"source": "document.pdf", "format": "text", "page": 1, "hybrid": "docling-fast"}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opendataloader_pdf_llamaindex-0.0.1.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opendataloader_pdf_llamaindex-0.0.1-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file opendataloader_pdf_llamaindex-0.0.1.tar.gz.

File metadata

File hashes

Hashes for opendataloader_pdf_llamaindex-0.0.1.tar.gz
Algorithm Hash digest
SHA256 90d434f812e3b9c8711e0cf4c43c9ce18ba912a80a89e29f69727ee01f4efa75
MD5 f2ec15830b658da2f87e194afd475032
BLAKE2b-256 32017806b5c4a0febb1d73938e6e30289ead5082b8afb83d31650fc90777a5e5

See more details on using hashes here.

Provenance

The following attestation bundles were made for opendataloader_pdf_llamaindex-0.0.1.tar.gz:

Publisher: release.yml on opendataloader-project/opendataloader-pdf-llamaindex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file opendataloader_pdf_llamaindex-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for opendataloader_pdf_llamaindex-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8b99af8f4d5ca2d175ead3c9183f8e00527a4a1626aba34d0d59b1f4b87e6a70
MD5 af0f1613b5e58b2e5e38c0bc8b9bc35f
BLAKE2b-256 f515786210bca5ac71ab6d536c4d973404b2324adde62d4c738be207d09a7a2b

See more details on using hashes here.

Provenance

The following attestation bundles were made for opendataloader_pdf_llamaindex-0.0.1-py3-none-any.whl:

Publisher: release.yml on opendataloader-project/opendataloader-pdf-llamaindex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page