LangChain document loader for oxidize-pdf — fast, Rust-powered PDF parsing with element-disjoint RAG chunking
Project description
langchain-oxidize-pdf
LangChain document loader backed by oxidize-pdf, a fast Rust-powered PDF engine with first-class RAG chunking.
0.1.0 (2026-04-24) — Requires
oxidize-pdf>=0.4.3(oxidize-pdf-core 2.5.5). First release. The siblingllama-index-readers-oxidize-pdf0.1.0 shipped with shape-only tests that missed a quadratic accumulation bug in the underlying chunker; this loader ships from day one with the semantic regression suite (test_loader_disjoint.py) that guarantees the disjointness contract end-to-end.
Install
pip install langchain-oxidize-pdf
Usage
LangChain convention binds the file path to the loader instance and
uses lazy_load() as the primary entry point; load() is inherited
from BaseLoader as a convenience that materializes the iterator.
RAG chunks (default)
from langchain_oxidize_pdf import OxidizePdfLoader
loader = OxidizePdfLoader("paper.pdf") # mode="rag" by default
documents = loader.load()
for doc in documents:
print(doc.metadata["chunk_index"], doc.metadata["heading_context"])
print(doc.page_content[:200])
Each Document carries:
| Field | Description |
|---|---|
chunk_index |
0-based index within the document |
page_numbers |
list of 1-indexed pages covered by the chunk |
element_types |
list of semantic types detected (e.g. title, paragraph) |
heading_context |
nearest surrounding heading, or None |
token_estimate |
rough token count for budget planning |
file_path / file_name / total_pages / pdf_version |
source metadata |
One document per page
loader = OxidizePdfLoader("paper.pdf", mode="pages")
for doc in loader.lazy_load():
print(doc.metadata["page_number"], len(doc.page_content))
Whole PDF as markdown
loader = OxidizePdfLoader("paper.pdf", mode="markdown")
[doc] = loader.load()
print(doc.page_content)
Adding caller metadata
loader = OxidizePdfLoader(
"paper.pdf",
extra_info={"source": "arxiv:2501.12345", "collection": "benchmarks"},
)
Keys in extra_info override base metadata (file_path, file_name,
total_pages, pdf_version) if they collide — explicit caller intent.
Why oxidize-pdf
- Rust parser: fast on large PDFs, low memory footprint.
- Native RAG primitives: element-disjoint semantic chunking, element partitioning, heading-aware context — no post-processing needed. The disjointness contract (no chunk's text is a substring of another's; each source element appears in exactly one chunk) is enforced by regression tests in both this loader and the underlying bridge.
- CJK friendly: compact output for multibyte documents (see oxidize-pdf 2.5.4 subsetter fixes).
- Pure Python install: ships as a wheel for Linux/macOS/Windows via the
oxidize-pdfpackage; no system dependencies. - Real lazy loading:
lazy_load()returns a generator, so large PDFs don't force everyDocumentinto memory upfront.
Source
Part of oxidize-pdf-integrations, the ecosystem of integrations around oxidize-pdf. The Rust core and Python bridge live in oxidize-python.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_oxidize_pdf-0.1.0.tar.gz.
File metadata
- Download URL: langchain_oxidize_pdf-0.1.0.tar.gz
- Upload date:
- Size: 4.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67c7e834e4b109714f5e3e807120733643532f75588b58f616d5ef431588b20b
|
|
| MD5 |
71b34c5c9d9c7ce1afebf9dbed4ec5e4
|
|
| BLAKE2b-256 |
b154c1acd610b9b0e77d36dd61f58d15e32c6d558d2abdd90d9790e0f34ca531
|
Provenance
The following attestation bundles were made for langchain_oxidize_pdf-0.1.0.tar.gz:
Publisher:
release-langchain.yml on bzsanti/oxidize-pdf-integrations
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
langchain_oxidize_pdf-0.1.0.tar.gz -
Subject digest:
67c7e834e4b109714f5e3e807120733643532f75588b58f616d5ef431588b20b - Sigstore transparency entry: 1575147840
- Sigstore integration time:
-
Permalink:
bzsanti/oxidize-pdf-integrations@213ce38bdc4c3e57beab7716c007c82070fa10bf -
Branch / Tag:
refs/tags/langchain-v0.1.0 - Owner: https://github.com/bzsanti
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-langchain.yml@213ce38bdc4c3e57beab7716c007c82070fa10bf -
Trigger Event:
push
-
Statement type:
File details
Details for the file langchain_oxidize_pdf-0.1.0-py3-none-any.whl.
File metadata
- Download URL: langchain_oxidize_pdf-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
520c8e7df95690acaa242433ed04e50c20873e221b1962d938a6c11603567125
|
|
| MD5 |
f7e2519d5bddf7732b2db6c332a4f6b4
|
|
| BLAKE2b-256 |
309e115cadf2ea9dd89268e7f54fc3b647b195ecec128c18385ef45820d61b9e
|
Provenance
The following attestation bundles were made for langchain_oxidize_pdf-0.1.0-py3-none-any.whl:
Publisher:
release-langchain.yml on bzsanti/oxidize-pdf-integrations
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
langchain_oxidize_pdf-0.1.0-py3-none-any.whl -
Subject digest:
520c8e7df95690acaa242433ed04e50c20873e221b1962d938a6c11603567125 - Sigstore transparency entry: 1575147876
- Sigstore integration time:
-
Permalink:
bzsanti/oxidize-pdf-integrations@213ce38bdc4c3e57beab7716c007c82070fa10bf -
Branch / Tag:
refs/tags/langchain-v0.1.0 - Owner: https://github.com/bzsanti
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-langchain.yml@213ce38bdc4c3e57beab7716c007c82070fa10bf -
Trigger Event:
push
-
Statement type: