LlamaIndex reader for OpenDataLoader PDF — fast, accurate, local PDF extraction

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

benedict-lee

These details have not been verified by PyPI

Project links

Homepage

Project description

opendataloader-pdf-llamaindex

LlamaIndex reader for OpenDataLoader PDF — parse PDFs into structured Document objects for RAG pipelines.

For the full feature set of the core engine (hybrid AI mode, OCR, formula extraction, benchmarks, accessibility), see the OpenDataLoader PDF documentation.

Features

Accurate reading order — XY-Cut++ algorithm handles multi-column layouts correctly
Table extraction — Preserves table structure in output
Multiple formats — Text, Markdown, JSON (with bounding boxes), HTML
Per-page splitting — Each page becomes a separate Document with page number metadata
AI safety — Built-in prompt injection filtering (hidden text, off-page content, invisible layers)
100% local — No cloud APIs, your documents never leave your machine
Fast — Rule-based extraction, no GPU required

Requirements

Python >= 3.10
Java 11+ available on system PATH

Verify Java is installed:

java -version

Installation

pip install -U opendataloader-pdf-llamaindex

Quick Start

from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader

reader = OpenDataLoaderPDFReader(format="text")
documents = reader.load_data(file_path="document.pdf")

print(documents[0].text)
print(documents[0].metadata)
# {'source': 'document.pdf', 'format': 'text', 'page': 1}

SimpleDirectoryReader Integration

Use with LlamaIndex's SimpleDirectoryReader via the file_extractor parameter:

from llama_index.core import SimpleDirectoryReader
from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader

reader = SimpleDirectoryReader(
    input_dir="./documents",
    file_extractor={".pdf": OpenDataLoaderPDFReader(format="markdown")}
)
documents = reader.load_data()

Usage Examples

Output Formats

from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader

# Plain text (default) — best for simple RAG
reader = OpenDataLoaderPDFReader(format="text")

# Markdown — preserves headings, lists, tables
reader = OpenDataLoaderPDFReader(format="markdown")

# JSON — structured data with bounding boxes for source citations
reader = OpenDataLoaderPDFReader(format="json")

# HTML — styled output
reader = OpenDataLoaderPDFReader(format="html")

Tagged PDF Support

For accessible PDFs with structure tags (common in government/legal documents):

reader = OpenDataLoaderPDFReader(use_struct_tree=True)

Table Detection

reader = OpenDataLoaderPDFReader(
    format="markdown",
    table_method="cluster"  # Better for borderless tables
)

Sensitive Data Sanitization

reader = OpenDataLoaderPDFReader(sanitize=True)
# Replaces emails, phone numbers, IPs, credit cards, URLs with placeholders

Page Selection

reader = OpenDataLoaderPDFReader(pages="1,3,5-7")

Headers and Footers

reader = OpenDataLoaderPDFReader(include_header_footer=True)

Password-Protected PDFs

reader = OpenDataLoaderPDFReader(password="secret")

Image Handling

# Embed images as Base64 in output
reader = OpenDataLoaderPDFReader(image_output="embedded")

# Save images to external files
reader = OpenDataLoaderPDFReader(
    image_output="external",
    image_dir="./extracted_images"
)

Hybrid AI Mode

For higher accuracy on complex documents (requires a running hybrid backend):

reader = OpenDataLoaderPDFReader(
    hybrid="docling-fast",
    hybrid_fallback=True  # Fall back to Java on backend failure
)

RAG Pipeline Example

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader

# Load PDFs
reader = SimpleDirectoryReader(
    input_dir="./documents",
    file_extractor={".pdf": OpenDataLoaderPDFReader(format="markdown")}
)
documents = reader.load_data()

# Build index and query
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What are the key findings?")
print(response)

Parameters

Parameter	Type	Default	Description
`format`	`str`	`"text"`	Output format: `"text"`, `"markdown"`, `"json"`, `"html"`
`split_pages`	`bool`	`True`	Split output into separate Documents per page
`quiet`	`bool`	`False`	Suppress CLI logging output
`content_safety_off`	`list[str]`	`None`	Safety filters to disable: `"all"`, `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`
`password`	`str`	`None`	Password for encrypted PDFs
`keep_line_breaks`	`bool`	`False`	Preserve original line breaks
`replace_invalid_chars`	`str`	`None`	Replacement for unrecognized characters
`use_struct_tree`	`bool`	`False`	Use PDF structure tree (tagged PDFs)
`table_method`	`str`	`None`	`"default"` (border-based) or `"cluster"` (border + cluster)
`reading_order`	`str`	`None`	`"off"` or `"xycut"` (default when not specified)
`image_output`	`str`	`"off"`	`"off"`, `"embedded"` (Base64), `"external"` (files)
`image_format`	`str`	`None`	`"png"` or `"jpeg"`
`image_dir`	`str`	`None`	Directory for external images
`sanitize`	`bool`	`False`	Mask emails, phones, IPs, credit cards, URLs
`pages`	`str`	`None`	Pages to extract, e.g., `"1,3,5-7"`
`include_header_footer`	`bool`	`False`	Include page headers and footers
`detect_strikethrough`	`bool`	`False`	Detect strikethrough text (experimental)
`hybrid`	`str`	`None`	Hybrid AI backend: `"docling-fast"`
`hybrid_mode`	`str`	`None`	`"auto"` (complex pages only) or `"full"` (all pages)
`hybrid_url`	`str`	`None`	Custom backend server URL
`hybrid_timeout`	`str`	`None`	Backend timeout in milliseconds
`hybrid_fallback`	`bool`	`False`	Fall back to Java on backend failure

Document Metadata

Each Document includes metadata:

With split_pages=True (default):

{"source": "document.pdf", "format": "text", "page": 1}

With split_pages=False:

{"source": "document.pdf", "format": "text"}

With hybrid mode:

{"source": "document.pdf", "format": "text", "page": 1, "hybrid": "docling-fast"}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

benedict-lee

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.3

Apr 14, 2026

0.0.2

Apr 10, 2026

0.0.1

Apr 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opendataloader_pdf_llamaindex-0.0.3.tar.gz (10.8 kB view details)

Uploaded Apr 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

opendataloader_pdf_llamaindex-0.0.3-py3-none-any.whl (12.2 kB view details)

Uploaded Apr 14, 2026 Python 3

File details

Details for the file opendataloader_pdf_llamaindex-0.0.3.tar.gz.

File metadata

Download URL: opendataloader_pdf_llamaindex-0.0.3.tar.gz
Upload date: Apr 14, 2026
Size: 10.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for opendataloader_pdf_llamaindex-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`deec4c4c21c5224c9fdcf22e08364c46e47b96bc5b13034921f751b2be626b83`
MD5	`69184abb9a3281fdef3899e92ad5522d`
BLAKE2b-256	`dd7a77eda670bf0659b1c03882a74c05d66d95770a7c7b097a36d8f84108f3fa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for opendataloader_pdf_llamaindex-0.0.3.tar.gz:

Publisher: release.yml on opendataloader-project/opendataloader-pdf-llamaindex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: opendataloader_pdf_llamaindex-0.0.3.tar.gz
- Subject digest: deec4c4c21c5224c9fdcf22e08364c46e47b96bc5b13034921f751b2be626b83
- Sigstore transparency entry: 1292808066
- Sigstore integration time: Apr 14, 2026
Source repository:
- Permalink: opendataloader-project/opendataloader-pdf-llamaindex@cba73a6f133bfb34d1a6d61ecaeeca26e559aa5e
- Branch / Tag: refs/tags/v0.0.3
- Owner: https://github.com/opendataloader-project
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@cba73a6f133bfb34d1a6d61ecaeeca26e559aa5e
- Trigger Event: push

File details

Details for the file opendataloader_pdf_llamaindex-0.0.3-py3-none-any.whl.

File metadata

Download URL: opendataloader_pdf_llamaindex-0.0.3-py3-none-any.whl
Upload date: Apr 14, 2026
Size: 12.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for opendataloader_pdf_llamaindex-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b3181e0ce0e7f3cc663a24cceb2919fd663813d551dc544129f2f1676805d4ed`
MD5	`16787312abb317c95e8c1a93baa52ccf`
BLAKE2b-256	`39802a4b4209dd4b51624cf221ad51c139d744f580c112b07c6be4cb97438275`

See more details on using hashes here.

Provenance

The following attestation bundles were made for opendataloader_pdf_llamaindex-0.0.3-py3-none-any.whl:

Publisher: release.yml on opendataloader-project/opendataloader-pdf-llamaindex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: opendataloader_pdf_llamaindex-0.0.3-py3-none-any.whl
- Subject digest: b3181e0ce0e7f3cc663a24cceb2919fd663813d551dc544129f2f1676805d4ed
- Sigstore transparency entry: 1292808148
- Sigstore integration time: Apr 14, 2026
Source repository:
- Permalink: opendataloader-project/opendataloader-pdf-llamaindex@cba73a6f133bfb34d1a6d61ecaeeca26e559aa5e
- Branch / Tag: refs/tags/v0.0.3
- Owner: https://github.com/opendataloader-project
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@cba73a6f133bfb34d1a6d61ecaeeca26e559aa5e
- Trigger Event: push

opendataloader-pdf-llamaindex 0.0.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

opendataloader-pdf-llamaindex

Features

Requirements

Installation

Quick Start

SimpleDirectoryReader Integration

Usage Examples

Output Formats

Tagged PDF Support

Table Detection

Sensitive Data Sanitization

Page Selection

Headers and Footers

Password-Protected PDFs

Image Handling

Hybrid AI Mode

RAG Pipeline Example

Parameters

Document Metadata

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance