A LangChain integration for OpenDataLoader PDF

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

benedict-lee

These details have not been verified by PyPI

Project links

Homepage

Project description

langchain-opendataloader-pdf

LangChain document loader for OpenDataLoader PDF — parse PDFs into structured Document objects for RAG pipelines.

For the full feature set of the core engine (hybrid AI mode, OCR, formula extraction, benchmarks, accessibility), see the OpenDataLoader PDF documentation.

Features

Accurate reading order — XY-Cut++ algorithm handles multi-column layouts correctly
Table extraction — Preserves table structure in output
Multiple formats — Text, Markdown, JSON (with bounding boxes), HTML
Per-page splitting — Each page becomes a separate Document with page number metadata
AI safety — Built-in prompt injection filtering (hidden text, off-page content, invisible layers)
100% local — No cloud APIs, your documents never leave your machine
Fast — Rule-based extraction, no GPU required

Requirements

Python >= 3.10
Java 11+ available on system PATH

Installation

pip install -U langchain-opendataloader-pdf

Quick Start

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
    file_path="document.pdf",
    format="text"
)
documents = loader.load()

print(documents[0].page_content)
print(documents[0].metadata)
# {'source': 'document.pdf', 'format': 'text', 'page': 1}

Usage Examples

Batch Processing

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

# Single file, multiple files, or directories — all in one call
loader = OpenDataLoaderPDFLoader(
    file_path=["report1.pdf", "report2.pdf", "documents/"]
)
docs = loader.load()

Output Formats

# Plain text (default) — best for simple RAG
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="text")

# Markdown — preserves headings, lists, tables
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="markdown")

# JSON — structured data with bounding boxes for source citations
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="json")

# HTML — styled output
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="html")

Tagged PDF Support

For accessible PDFs with structure tags (common in government/legal documents):

loader = OpenDataLoaderPDFLoader(
    file_path="accessible_document.pdf",
    use_struct_tree=True  # Use native PDF structure
)

Table Detection

loader = OpenDataLoaderPDFLoader(
    file_path="financial_report.pdf",
    format="markdown",
    table_method="cluster"  # Better for borderless tables
)

Sensitive Data Sanitization

# Replace emails, phone numbers, IPs, credit cards, URLs with placeholders
loader = OpenDataLoaderPDFLoader(
    file_path="document.pdf",
    sanitize=True
)

Extract Specific Pages

loader = OpenDataLoaderPDFLoader(
    file_path="document.pdf",
    pages="1,3,5-10"
)

Include Headers and Footers

# By default, headers and footers are excluded for cleaner RAG output
loader = OpenDataLoaderPDFLoader(
    file_path="document.pdf",
    include_header_footer=True
)

Password-Protected PDFs

loader = OpenDataLoaderPDFLoader(
    file_path="encrypted.pdf",
    password="secret123"
)

Image Handling

# Images are excluded by default (image_output="off")
# This is optimal for text-based RAG pipelines

# Embed images as Base64 (for multimodal RAG)
loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    format="markdown",
    image_output="embedded",
    image_format="jpeg"  # or "png"
)

# Save images as files to a local directory
loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    format="markdown",
    image_output="external",
    image_dir="./images",   # images saved here; defaults to temp dir if not set
    image_format="png"
)

Hybrid AI Mode

For complex documents (tables, charts, scanned content), hybrid mode routes pages to an AI backend for better accuracy while keeping simple pages on the fast local engine:

# Requires a running docling-fast server (default: localhost:5002)
loader = OpenDataLoaderPDFLoader(
    file_path="complex_report.pdf",
    format="markdown",
    hybrid="docling-fast",          # Enable hybrid extraction
    hybrid_mode="auto",             # Auto-triage: only complex pages go to backend
    hybrid_url="http://localhost:5002",
)
documents = loader.load()

# Document metadata shows which backend was used
print(documents[0].metadata)
# {'source': 'complex_report.pdf', 'format': 'markdown', 'page': 1, 'hybrid': 'docling-fast'}

Suppress Logging

loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    quiet=True
)

RAG Pipeline Example

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Load PDF
loader = OpenDataLoaderPDFLoader(
    file_path="knowledge_base.pdf",
    format="markdown",
    quiet=True
)
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(documents)

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)

# Query
results = vectorstore.similarity_search("What is the main topic?")

Parameters Reference

Parameter	Type	Default	Description
`file_path`	`str \| List[str]`	—	(Required) PDF file path(s) or directories
`format`	`str`	`"text"`	Output format: `"text"`, `"markdown"`, `"json"`, `"html"`
`split_pages`	`bool`	`True`	Split into separate Documents per page
`quiet`	`bool`	`False`	Suppress console logging
`password`	`str`	`None`	Password for encrypted PDFs
`use_struct_tree`	`bool`	`False`	Use PDF structure tree (tagged PDFs)
`table_method`	`str`	`"default"`	`"default"` (border-based) or `"cluster"` (border + clustering)
`reading_order`	`str`	`"xycut"`	`"xycut"` or `"off"`
`keep_line_breaks`	`bool`	`False`	Preserve original line breaks
`image_output`	`str`	`"off"`	`"off"`, `"embedded"` (Base64), or `"external"`
`image_format`	`str`	`"png"`	`"png"` or `"jpeg"`
`image_dir`	`str`	`None`	Directory for extracted images when using `image_output="external"`
`sanitize`	`bool`	`False`	Sanitize sensitive data (emails, phone numbers, IPs, credit cards, URLs)
`pages`	`str`	`None`	Pages to extract (e.g., `"1,3,5-7"`). Default: all pages
`include_header_footer`	`bool`	`False`	Include page headers and footers in output
`content_safety_off`	`List[str]`	`None`	Disable safety filters: `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`, `"all"`
`replace_invalid_chars`	`str`	`None`	Replacement for invalid characters
`hybrid`	`str`	`None`	Hybrid AI backend: `"docling-fast"`. Requires running backend server
`hybrid_mode`	`str`	`None`	`"auto"` (route complex pages) or `"full"` (route all pages)
`hybrid_url`	`str`	`None`	Backend server URL. Default: `http://localhost:5002`
`hybrid_timeout`	`str`	`None`	Backend timeout in ms. Default: `"30000"`
`hybrid_fallback`	`bool`	`False`	Fall back to Java extraction on backend failure

Document Metadata

Each returned Document includes metadata:

doc.metadata
# {'source': 'document.pdf', 'format': 'text', 'page': 1}

# When hybrid mode is active:
# {'source': 'document.pdf', 'format': 'text', 'page': 1, 'hybrid': 'docling-fast'}

When split_pages=False, the page key is omitted.

License

Apache License 2.0. See LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

benedict-lee

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2.0.0

Mar 16, 2026

1.2.0

Mar 4, 2026

1.1.1

Jan 2, 2026

1.1.0

Dec 31, 2025

1.0.1

Dec 10, 2025

1.0.0

Dec 8, 2025

0.1.0

Mar 16, 2026

0.0.2

Oct 1, 2025

0.0.1

Sep 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_opendataloader_pdf-2.0.0.tar.gz (1.8 MB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

langchain_opendataloader_pdf-2.0.0-py3-none-any.whl (13.0 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file langchain_opendataloader_pdf-2.0.0.tar.gz.

File metadata

Download URL: langchain_opendataloader_pdf-2.0.0.tar.gz
Upload date: Mar 16, 2026
Size: 1.8 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langchain_opendataloader_pdf-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`14c83dc1df9f3ef0c1e5698da6d30db3a8af1151f5ae42f09bffca1156acdd47`
MD5	`a90a2563aecc90bc942f316e1e5dcba9`
BLAKE2b-256	`73e9650020e85d492ffe6cf2658fa2f6ef3476956286e9c3d6d01b0cc0ed4cba`

See more details on using hashes here.

Provenance

The following attestation bundles were made for langchain_opendataloader_pdf-2.0.0.tar.gz:

Publisher: release.yml on opendataloader-project/langchain-opendataloader-pdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: langchain_opendataloader_pdf-2.0.0.tar.gz
- Subject digest: 14c83dc1df9f3ef0c1e5698da6d30db3a8af1151f5ae42f09bffca1156acdd47
- Sigstore transparency entry: 1109375526
- Sigstore integration time: Mar 16, 2026
Source repository:
- Permalink: opendataloader-project/langchain-opendataloader-pdf@809163650c084e448a44dd27d88e23961f37c9cd
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/opendataloader-project
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@809163650c084e448a44dd27d88e23961f37c9cd
- Trigger Event: push

File details

Details for the file langchain_opendataloader_pdf-2.0.0-py3-none-any.whl.

File metadata

Download URL: langchain_opendataloader_pdf-2.0.0-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 13.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langchain_opendataloader_pdf-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4975f06b5d1a98826f96e328e5a26d67765e87d50a87f637659da35284feeee4`
MD5	`6bdf28f628905b1a58b245956a762979`
BLAKE2b-256	`d86938fd79483e8c3de45086202c562dea38095aedb7dfa10206ab5a7f327ad4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for langchain_opendataloader_pdf-2.0.0-py3-none-any.whl:

Publisher: release.yml on opendataloader-project/langchain-opendataloader-pdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: langchain_opendataloader_pdf-2.0.0-py3-none-any.whl
- Subject digest: 4975f06b5d1a98826f96e328e5a26d67765e87d50a87f637659da35284feeee4
- Sigstore transparency entry: 1109375545
- Sigstore integration time: Mar 16, 2026
Source repository:
- Permalink: opendataloader-project/langchain-opendataloader-pdf@809163650c084e448a44dd27d88e23961f37c9cd
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/opendataloader-project
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@809163650c084e448a44dd27d88e23961f37c9cd
- Trigger Event: push

langchain-opendataloader-pdf 2.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

langchain-opendataloader-pdf

Features

Requirements

Installation

Quick Start

Usage Examples

Batch Processing

Output Formats

Tagged PDF Support

Table Detection

Sensitive Data Sanitization

Extract Specific Pages

Include Headers and Footers

Password-Protected PDFs

Image Handling

Hybrid AI Mode

Suppress Logging

RAG Pipeline Example

Parameters Reference

Document Metadata

License

Links

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance