LlamaIndex reader for OpenDataLoader PDF — fast, accurate, local PDF extraction
Project description
opendataloader-pdf-llamaindex
LlamaIndex reader for OpenDataLoader PDF — parse PDFs into structured Document objects for RAG pipelines.
For the full feature set of the core engine (hybrid AI mode, OCR, formula extraction, benchmarks, accessibility), see the OpenDataLoader PDF documentation.
Features
- Accurate reading order — XY-Cut++ algorithm handles multi-column layouts correctly
- Table extraction — Preserves table structure in output
- Multiple formats — Text, Markdown, JSON (with bounding boxes), HTML
- Per-page splitting — Each page becomes a separate
Documentwith page number metadata - AI safety — Built-in prompt injection filtering (hidden text, off-page content, invisible layers)
- 100% local — No cloud APIs, your documents never leave your machine
- Fast — Rule-based extraction, no GPU required
Requirements
- Python >= 3.10
- Java 11+ available on system
PATH
Verify Java is installed:
java -version
Installation
pip install -U opendataloader-pdf-llamaindex
Quick Start
from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader
reader = OpenDataLoaderPDFReader(format="text")
documents = reader.load_data(file_path="document.pdf")
print(documents[0].text)
print(documents[0].metadata)
# {'source': 'document.pdf', 'format': 'text', 'page': 1}
SimpleDirectoryReader Integration
Use with LlamaIndex's SimpleDirectoryReader via the file_extractor parameter:
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader
reader = SimpleDirectoryReader(
input_dir="./documents",
file_extractor={".pdf": OpenDataLoaderPDFReader(format="markdown")}
)
documents = reader.load_data()
Usage Examples
Output Formats
from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader
# Plain text (default) — best for simple RAG
reader = OpenDataLoaderPDFReader(format="text")
# Markdown — preserves headings, lists, tables
reader = OpenDataLoaderPDFReader(format="markdown")
# JSON — structured data with bounding boxes for source citations
reader = OpenDataLoaderPDFReader(format="json")
# HTML — styled output
reader = OpenDataLoaderPDFReader(format="html")
Tagged PDF Support
For accessible PDFs with structure tags (common in government/legal documents):
reader = OpenDataLoaderPDFReader(use_struct_tree=True)
Table Detection
reader = OpenDataLoaderPDFReader(
format="markdown",
table_method="cluster" # Better for borderless tables
)
Sensitive Data Sanitization
reader = OpenDataLoaderPDFReader(sanitize=True)
# Replaces emails, phone numbers, IPs, credit cards, URLs with placeholders
Page Selection
reader = OpenDataLoaderPDFReader(pages="1,3,5-7")
Headers and Footers
reader = OpenDataLoaderPDFReader(include_header_footer=True)
Password-Protected PDFs
reader = OpenDataLoaderPDFReader(password="secret")
Image Handling
# Embed images as Base64 in output
reader = OpenDataLoaderPDFReader(image_output="embedded")
# Save images to external files
reader = OpenDataLoaderPDFReader(
image_output="external",
image_dir="./extracted_images"
)
Hybrid AI Mode
For higher accuracy on complex documents (requires a running hybrid backend):
reader = OpenDataLoaderPDFReader(
hybrid="docling-fast",
hybrid_fallback=True # Fall back to Java on backend failure
)
RAG Pipeline Example
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.readers.opendataloader_pdf import OpenDataLoaderPDFReader
# Load PDFs
reader = SimpleDirectoryReader(
input_dir="./documents",
file_extractor={".pdf": OpenDataLoaderPDFReader(format="markdown")}
)
documents = reader.load_data()
# Build index and query
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What are the key findings?")
print(response)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
format |
str |
"text" |
Output format: "text", "markdown", "json", "html" |
split_pages |
bool |
True |
Split output into separate Documents per page |
quiet |
bool |
False |
Suppress CLI logging output |
content_safety_off |
list[str] |
None |
Safety filters to disable: "all", "hidden-text", "off-page", "tiny", "hidden-ocg" |
password |
str |
None |
Password for encrypted PDFs |
keep_line_breaks |
bool |
False |
Preserve original line breaks |
replace_invalid_chars |
str |
None |
Replacement for unrecognized characters |
use_struct_tree |
bool |
False |
Use PDF structure tree (tagged PDFs) |
table_method |
str |
None |
"default" (border-based) or "cluster" (border + cluster) |
reading_order |
str |
None |
"off" or "xycut" (default when not specified) |
image_output |
str |
"off" |
"off", "embedded" (Base64), "external" (files) |
image_format |
str |
None |
"png" or "jpeg" |
image_dir |
str |
None |
Directory for external images |
sanitize |
bool |
False |
Mask emails, phones, IPs, credit cards, URLs |
pages |
str |
None |
Pages to extract, e.g., "1,3,5-7" |
include_header_footer |
bool |
False |
Include page headers and footers |
detect_strikethrough |
bool |
False |
Detect strikethrough text (experimental) |
hybrid |
str |
None |
Hybrid AI backend: "docling-fast" |
hybrid_mode |
str |
None |
"auto" (complex pages only) or "full" (all pages) |
hybrid_url |
str |
None |
Custom backend server URL |
hybrid_timeout |
str |
None |
Backend timeout in milliseconds |
hybrid_fallback |
bool |
False |
Fall back to Java on backend failure |
Document Metadata
Each Document includes metadata:
With split_pages=True (default):
{"source": "document.pdf", "format": "text", "page": 1}
With split_pages=False:
{"source": "document.pdf", "format": "text"}
With hybrid mode:
{"source": "document.pdf", "format": "text", "page": 1, "hybrid": "docling-fast"}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file opendataloader_pdf_llamaindex-0.0.3.tar.gz.
File metadata
- Download URL: opendataloader_pdf_llamaindex-0.0.3.tar.gz
- Upload date:
- Size: 10.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
deec4c4c21c5224c9fdcf22e08364c46e47b96bc5b13034921f751b2be626b83
|
|
| MD5 |
69184abb9a3281fdef3899e92ad5522d
|
|
| BLAKE2b-256 |
dd7a77eda670bf0659b1c03882a74c05d66d95770a7c7b097a36d8f84108f3fa
|
Provenance
The following attestation bundles were made for opendataloader_pdf_llamaindex-0.0.3.tar.gz:
Publisher:
release.yml on opendataloader-project/opendataloader-pdf-llamaindex
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
opendataloader_pdf_llamaindex-0.0.3.tar.gz -
Subject digest:
deec4c4c21c5224c9fdcf22e08364c46e47b96bc5b13034921f751b2be626b83 - Sigstore transparency entry: 1292808066
- Sigstore integration time:
-
Permalink:
opendataloader-project/opendataloader-pdf-llamaindex@cba73a6f133bfb34d1a6d61ecaeeca26e559aa5e -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/opendataloader-project
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cba73a6f133bfb34d1a6d61ecaeeca26e559aa5e -
Trigger Event:
push
-
Statement type:
File details
Details for the file opendataloader_pdf_llamaindex-0.0.3-py3-none-any.whl.
File metadata
- Download URL: opendataloader_pdf_llamaindex-0.0.3-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3181e0ce0e7f3cc663a24cceb2919fd663813d551dc544129f2f1676805d4ed
|
|
| MD5 |
16787312abb317c95e8c1a93baa52ccf
|
|
| BLAKE2b-256 |
39802a4b4209dd4b51624cf221ad51c139d744f580c112b07c6be4cb97438275
|
Provenance
The following attestation bundles were made for opendataloader_pdf_llamaindex-0.0.3-py3-none-any.whl:
Publisher:
release.yml on opendataloader-project/opendataloader-pdf-llamaindex
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
opendataloader_pdf_llamaindex-0.0.3-py3-none-any.whl -
Subject digest:
b3181e0ce0e7f3cc663a24cceb2919fd663813d551dc544129f2f1676805d4ed - Sigstore transparency entry: 1292808148
- Sigstore integration time:
-
Permalink:
opendataloader-project/opendataloader-pdf-llamaindex@cba73a6f133bfb34d1a6d61ecaeeca26e559aa5e -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/opendataloader-project
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cba73a6f133bfb34d1a6d61ecaeeca26e559aa5e -
Trigger Event:
push
-
Statement type: