Skip to main content

A LangChain integration for OpenDataLoader PDF

Project description

langchain-opendataloader-pdf

LangChain document loader for OpenDataLoader PDF — parse PDFs into structured Document objects for RAG pipelines.

For the full feature set of the core engine (hybrid AI mode, OCR, formula extraction, benchmarks, accessibility), see the OpenDataLoader PDF documentation.

PyPI version License

Features

  • Accurate reading order — XY-Cut++ algorithm handles multi-column layouts correctly
  • Table extraction — Preserves table structure in output
  • Multiple formats — Text, Markdown, JSON (with bounding boxes), HTML
  • Per-page splitting — Each page becomes a separate Document with page number metadata
  • AI safety — Built-in prompt injection filtering (hidden text, off-page content, invisible layers)
  • 100% local — No cloud APIs, your documents never leave your machine
  • Fast — Rule-based extraction, no GPU required

Requirements

  • Python >= 3.10
  • Java 11+ available on system PATH

Installation

pip install -U langchain-opendataloader-pdf

Quick Start

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
    file_path="document.pdf",
    format="text"
)
documents = loader.load()

print(documents[0].page_content)
print(documents[0].metadata)
# {'source': 'document.pdf', 'format': 'text', 'page': 1}

Usage Examples

Batch Processing

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

# Single file, multiple files, or directories — all in one call
loader = OpenDataLoaderPDFLoader(
    file_path=["report1.pdf", "report2.pdf", "documents/"]
)
docs = loader.load()

Output Formats

# Plain text (default) — best for simple RAG
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="text")

# Markdown — preserves headings, lists, tables
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="markdown")

# JSON — structured data with bounding boxes for source citations
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="json")

# HTML — styled output
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="html")

Tagged PDF Support

For accessible PDFs with structure tags (common in government/legal documents):

loader = OpenDataLoaderPDFLoader(
    file_path="accessible_document.pdf",
    use_struct_tree=True  # Use native PDF structure
)

Table Detection

loader = OpenDataLoaderPDFLoader(
    file_path="financial_report.pdf",
    format="markdown",
    table_method="cluster"  # Better for borderless tables
)

Sensitive Data Sanitization

# Replace emails, phone numbers, IPs, credit cards, URLs with placeholders
loader = OpenDataLoaderPDFLoader(
    file_path="document.pdf",
    sanitize=True
)

Extract Specific Pages

loader = OpenDataLoaderPDFLoader(
    file_path="document.pdf",
    pages="1,3,5-10"
)

Include Headers and Footers

# By default, headers and footers are excluded for cleaner RAG output
loader = OpenDataLoaderPDFLoader(
    file_path="document.pdf",
    include_header_footer=True
)

Password-Protected PDFs

loader = OpenDataLoaderPDFLoader(
    file_path="encrypted.pdf",
    password="secret123"
)

Image Handling

# Images are excluded by default (image_output="off")
# This is optimal for text-based RAG pipelines

# Embed images as Base64 (for multimodal RAG)
loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    format="markdown",
    image_output="embedded",
    image_format="jpeg"  # or "png"
)

# Save images as files to a local directory
loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    format="markdown",
    image_output="external",
    image_dir="./images",   # images saved here; defaults to temp dir if not set
    image_format="png"
)

Hybrid AI Mode

For complex documents (tables, charts, scanned content), hybrid mode routes pages to an AI backend for better accuracy while keeping simple pages on the fast local engine:

# Requires a running docling-fast server (default: localhost:5002)
loader = OpenDataLoaderPDFLoader(
    file_path="complex_report.pdf",
    format="markdown",
    hybrid="docling-fast",          # Enable hybrid extraction
    hybrid_mode="auto",             # Auto-triage: only complex pages go to backend
    hybrid_url="http://localhost:5002",
)
documents = loader.load()

# Document metadata shows which backend was used
print(documents[0].metadata)
# {'source': 'complex_report.pdf', 'format': 'markdown', 'page': 1, 'hybrid': 'docling-fast'}

Suppress Logging

loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    quiet=True
)

RAG Pipeline Example

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Load PDF
loader = OpenDataLoaderPDFLoader(
    file_path="knowledge_base.pdf",
    format="markdown",
    quiet=True
)
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(documents)

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)

# Query
results = vectorstore.similarity_search("What is the main topic?")

Parameters Reference

Parameter Type Default Description
file_path str | List[str] (Required) PDF file path(s) or directories
format str "text" Output format: "text", "markdown", "json", "html"
split_pages bool True Split into separate Documents per page
quiet bool False Suppress console logging
password str None Password for encrypted PDFs
use_struct_tree bool False Use PDF structure tree (tagged PDFs)
table_method str "default" "default" (border-based) or "cluster" (border + clustering)
reading_order str "xycut" "xycut" or "off"
keep_line_breaks bool False Preserve original line breaks
image_output str "off" "off", "embedded" (Base64), or "external"
image_format str "png" "png" or "jpeg"
image_dir str None Directory for extracted images when using image_output="external"
sanitize bool False Sanitize sensitive data (emails, phone numbers, IPs, credit cards, URLs)
pages str None Pages to extract (e.g., "1,3,5-7"). Default: all pages
include_header_footer bool False Include page headers and footers in output
content_safety_off List[str] None Disable safety filters: "hidden-text", "off-page", "tiny", "hidden-ocg", "all"
replace_invalid_chars str None Replacement for invalid characters
hybrid str None Hybrid AI backend: "docling-fast". Requires running backend server
hybrid_mode str None "auto" (route complex pages) or "full" (route all pages)
hybrid_url str None Backend server URL. Default: http://localhost:5002
hybrid_timeout str None Backend timeout in ms. Default: "30000"
hybrid_fallback bool False Fall back to Java extraction on backend failure

Document Metadata

Each returned Document includes metadata:

doc.metadata
# {'source': 'document.pdf', 'format': 'text', 'page': 1}

# When hybrid mode is active:
# {'source': 'document.pdf', 'format': 'text', 'page': 1, 'hybrid': 'docling-fast'}

When split_pages=False, the page key is omitted.

License

Apache License 2.0. See LICENSE for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_opendataloader_pdf-2.0.0.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_opendataloader_pdf-2.0.0-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file langchain_opendataloader_pdf-2.0.0.tar.gz.

File metadata

File hashes

Hashes for langchain_opendataloader_pdf-2.0.0.tar.gz
Algorithm Hash digest
SHA256 14c83dc1df9f3ef0c1e5698da6d30db3a8af1151f5ae42f09bffca1156acdd47
MD5 a90a2563aecc90bc942f316e1e5dcba9
BLAKE2b-256 73e9650020e85d492ffe6cf2658fa2f6ef3476956286e9c3d6d01b0cc0ed4cba

See more details on using hashes here.

Provenance

The following attestation bundles were made for langchain_opendataloader_pdf-2.0.0.tar.gz:

Publisher: release.yml on opendataloader-project/langchain-opendataloader-pdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file langchain_opendataloader_pdf-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_opendataloader_pdf-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4975f06b5d1a98826f96e328e5a26d67765e87d50a87f637659da35284feeee4
MD5 6bdf28f628905b1a58b245956a762979
BLAKE2b-256 d86938fd79483e8c3de45086202c562dea38095aedb7dfa10206ab5a7f327ad4

See more details on using hashes here.

Provenance

The following attestation bundles were made for langchain_opendataloader_pdf-2.0.0-py3-none-any.whl:

Publisher: release.yml on opendataloader-project/langchain-opendataloader-pdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page