A LangChain integration for OpenDataLoader PDF
Project description
langchain-opendataloader-pdf
LangChain document loader for OpenDataLoader PDF — parse PDFs into structured Document objects for RAG pipelines.
For the full feature set of the core engine (hybrid AI mode, OCR, formula extraction, benchmarks, accessibility), see the OpenDataLoader PDF documentation.
Features
- Accurate reading order — XY-Cut++ algorithm handles multi-column layouts correctly
- Table extraction — Preserves table structure in output
- Multiple formats — Text, Markdown, JSON (with bounding boxes), HTML
- Per-page splitting — Each page becomes a separate
Documentwith page number metadata - AI safety — Built-in prompt injection filtering (hidden text, off-page content, invisible layers)
- 100% local — No cloud APIs, your documents never leave your machine
- Fast — Rule-based extraction, no GPU required
Requirements
- Python >= 3.10
- Java 11+ available on system
PATH
Installation
pip install -U langchain-opendataloader-pdf
Quick Start
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
loader = OpenDataLoaderPDFLoader(
file_path="document.pdf",
format="text"
)
documents = loader.load()
print(documents[0].page_content)
print(documents[0].metadata)
# {'source': 'document.pdf', 'format': 'text', 'page': 1}
Usage Examples
Batch Processing
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
# Single file, multiple files, or directories — all in one call
loader = OpenDataLoaderPDFLoader(
file_path=["report1.pdf", "report2.pdf", "documents/"]
)
docs = loader.load()
Output Formats
# Plain text (default) — best for simple RAG
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="text")
# Markdown — preserves headings, lists, tables
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="markdown")
# JSON — structured data with bounding boxes for source citations
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="json")
# HTML — styled output
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="html")
Tagged PDF Support
For accessible PDFs with structure tags (common in government/legal documents):
loader = OpenDataLoaderPDFLoader(
file_path="accessible_document.pdf",
use_struct_tree=True # Use native PDF structure
)
Table Detection
loader = OpenDataLoaderPDFLoader(
file_path="financial_report.pdf",
format="markdown",
table_method="cluster" # Better for borderless tables
)
Sensitive Data Sanitization
# Replace emails, phone numbers, IPs, credit cards, URLs with placeholders
loader = OpenDataLoaderPDFLoader(
file_path="document.pdf",
sanitize=True
)
Extract Specific Pages
loader = OpenDataLoaderPDFLoader(
file_path="document.pdf",
pages="1,3,5-10"
)
Include Headers and Footers
# By default, headers and footers are excluded for cleaner RAG output
loader = OpenDataLoaderPDFLoader(
file_path="document.pdf",
include_header_footer=True
)
Password-Protected PDFs
loader = OpenDataLoaderPDFLoader(
file_path="encrypted.pdf",
password="secret123"
)
Image Handling
# Images are excluded by default (image_output="off")
# This is optimal for text-based RAG pipelines
# Embed images as Base64 (for multimodal RAG)
loader = OpenDataLoaderPDFLoader(
file_path="doc.pdf",
format="markdown",
image_output="embedded",
image_format="jpeg" # or "png"
)
# Save images as files to a local directory
loader = OpenDataLoaderPDFLoader(
file_path="doc.pdf",
format="markdown",
image_output="external",
image_dir="./images", # images saved here; defaults to temp dir if not set
image_format="png"
)
Hybrid AI Mode
For complex documents (tables, charts, scanned content), hybrid mode routes pages to an AI backend for better accuracy while keeping simple pages on the fast local engine:
# Requires a running docling-fast server (default: localhost:5002)
loader = OpenDataLoaderPDFLoader(
file_path="complex_report.pdf",
format="markdown",
hybrid="docling-fast", # Enable hybrid extraction
hybrid_mode="auto", # Auto-triage: only complex pages go to backend
hybrid_url="http://localhost:5002",
)
documents = loader.load()
# Document metadata shows which backend was used
print(documents[0].metadata)
# {'source': 'complex_report.pdf', 'format': 'markdown', 'page': 1, 'hybrid': 'docling-fast'}
Suppress Logging
loader = OpenDataLoaderPDFLoader(
file_path="doc.pdf",
quiet=True
)
RAG Pipeline Example
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
# Load PDF
loader = OpenDataLoaderPDFLoader(
file_path="knowledge_base.pdf",
format="markdown",
quiet=True
)
documents = loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_documents(documents)
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)
# Query
results = vectorstore.similarity_search("What is the main topic?")
Parameters Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path |
str | List[str] |
— | (Required) PDF file path(s) or directories |
format |
str |
"text" |
Output format: "text", "markdown", "json", "html" |
split_pages |
bool |
True |
Split into separate Documents per page |
quiet |
bool |
False |
Suppress console logging |
password |
str |
None |
Password for encrypted PDFs |
use_struct_tree |
bool |
False |
Use PDF structure tree (tagged PDFs) |
table_method |
str |
"default" |
"default" (border-based) or "cluster" (border + clustering) |
reading_order |
str |
"xycut" |
"xycut" or "off" |
keep_line_breaks |
bool |
False |
Preserve original line breaks |
image_output |
str |
"off" |
"off", "embedded" (Base64), or "external" |
image_format |
str |
"png" |
"png" or "jpeg" |
image_dir |
str |
None |
Directory for extracted images when using image_output="external" |
sanitize |
bool |
False |
Sanitize sensitive data (emails, phone numbers, IPs, credit cards, URLs) |
pages |
str |
None |
Pages to extract (e.g., "1,3,5-7"). Default: all pages |
include_header_footer |
bool |
False |
Include page headers and footers in output |
content_safety_off |
List[str] |
None |
Disable safety filters: "hidden-text", "off-page", "tiny", "hidden-ocg", "all" |
replace_invalid_chars |
str |
None |
Replacement for invalid characters |
hybrid |
str |
None |
Hybrid AI backend: "docling-fast". Requires running backend server |
hybrid_mode |
str |
None |
"auto" (route complex pages) or "full" (route all pages) |
hybrid_url |
str |
None |
Backend server URL. Default: http://localhost:5002 |
hybrid_timeout |
str |
None |
Backend timeout in ms. Default: "30000" |
hybrid_fallback |
bool |
False |
Fall back to Java extraction on backend failure |
Document Metadata
Each returned Document includes metadata:
doc.metadata
# {'source': 'document.pdf', 'format': 'text', 'page': 1}
# When hybrid mode is active:
# {'source': 'document.pdf', 'format': 'text', 'page': 1, 'hybrid': 'docling-fast'}
When split_pages=False, the page key is omitted.
License
Apache License 2.0. See LICENSE for details.
Links
- Documentation — Full documentation (hybrid mode, benchmarks, accessibility)
- GitHub — Core engine source code
- LangChain Docs — LangChain integration reference
- PyPI Package
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_opendataloader_pdf-2.0.0.tar.gz.
File metadata
- Download URL: langchain_opendataloader_pdf-2.0.0.tar.gz
- Upload date:
- Size: 1.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14c83dc1df9f3ef0c1e5698da6d30db3a8af1151f5ae42f09bffca1156acdd47
|
|
| MD5 |
a90a2563aecc90bc942f316e1e5dcba9
|
|
| BLAKE2b-256 |
73e9650020e85d492ffe6cf2658fa2f6ef3476956286e9c3d6d01b0cc0ed4cba
|
Provenance
The following attestation bundles were made for langchain_opendataloader_pdf-2.0.0.tar.gz:
Publisher:
release.yml on opendataloader-project/langchain-opendataloader-pdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
langchain_opendataloader_pdf-2.0.0.tar.gz -
Subject digest:
14c83dc1df9f3ef0c1e5698da6d30db3a8af1151f5ae42f09bffca1156acdd47 - Sigstore transparency entry: 1109375526
- Sigstore integration time:
-
Permalink:
opendataloader-project/langchain-opendataloader-pdf@809163650c084e448a44dd27d88e23961f37c9cd -
Branch / Tag:
refs/tags/v2.0.0 - Owner: https://github.com/opendataloader-project
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@809163650c084e448a44dd27d88e23961f37c9cd -
Trigger Event:
push
-
Statement type:
File details
Details for the file langchain_opendataloader_pdf-2.0.0-py3-none-any.whl.
File metadata
- Download URL: langchain_opendataloader_pdf-2.0.0-py3-none-any.whl
- Upload date:
- Size: 13.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4975f06b5d1a98826f96e328e5a26d67765e87d50a87f637659da35284feeee4
|
|
| MD5 |
6bdf28f628905b1a58b245956a762979
|
|
| BLAKE2b-256 |
d86938fd79483e8c3de45086202c562dea38095aedb7dfa10206ab5a7f327ad4
|
Provenance
The following attestation bundles were made for langchain_opendataloader_pdf-2.0.0-py3-none-any.whl:
Publisher:
release.yml on opendataloader-project/langchain-opendataloader-pdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
langchain_opendataloader_pdf-2.0.0-py3-none-any.whl -
Subject digest:
4975f06b5d1a98826f96e328e5a26d67765e87d50a87f637659da35284feeee4 - Sigstore transparency entry: 1109375545
- Sigstore integration time:
-
Permalink:
opendataloader-project/langchain-opendataloader-pdf@809163650c084e448a44dd27d88e23961f37c9cd -
Branch / Tag:
refs/tags/v2.0.0 - Owner: https://github.com/opendataloader-project
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@809163650c084e448a44dd27d88e23961f37c9cd -
Trigger Event:
push
-
Statement type: