Haystack converter for oxidize-pdf — fast, Rust-powered PDF parsing with element-disjoint RAG chunking, ByteStream-aware
Project description
haystack-oxidize-pdf
Haystack converter backed by oxidize-pdf, a fast Rust-powered PDF engine with first-class RAG chunking.
0.1.0 (2026-05-19) — Requires
oxidize-pdf>=0.4.3(oxidize-pdf-core 2.5.5+). Ships from day one with the full semantic regression suite (test_converter_disjoint.py, 6 tests) that guarantees the RAG-chunk disjointness contract end-to-end. Mirrors the discipline applied tolangchain-oxidize-pdf0.1.0 after thellama-index-readers-oxidize-pdf0.1.0 → 0.1.1 incident, where shape-only tests missed a quadratic accumulation bug in the underlying chunker.
Install
pip install haystack-oxidize-pdf
Usage
The converter is a Haystack @component and can be dropped straight into a Pipeline. Sources accept paths (str / pathlib.Path) and ByteStream objects interchangeably.
Three modes
| Mode | Output | Use case |
|---|---|---|
rag (default) |
one Document per RAG chunk with chunk_index, page_numbers, element_types, heading_context, token_estimate |
Vector-store ingestion for RAG |
pages |
one Document per page (plain text) with page_number |
Page-level indexing or compatibility with PyPDFToDocument-style pipelines |
markdown |
one Document per source containing the whole PDF as markdown |
Single-document export, no chunking |
RAG chunks (default)
from haystack_oxidize_pdf import OxidizePdfConverter
converter = OxidizePdfConverter() # mode="rag"
result = converter.run(sources=["paper.pdf"])
for doc in result["documents"]:
print(doc.meta["chunk_index"], doc.meta["heading_context"])
print(doc.content[:200])
Each Document.meta for mode="rag" carries:
| Field | Description |
|---|---|
chunk_index |
0-based index within the source (resets per source in batch mode) |
page_numbers |
list of 1-indexed pages covered by the chunk |
element_types |
list of semantic types detected (e.g. title, paragraph) |
heading_context |
"Section title > Subsection" string, prepended to content |
token_estimate |
conservative token-count estimate for chunk sizing |
file_path, file_name, total_pages, pdf_version |
source-level fields |
Pipeline integration
from haystack import Pipeline
from haystack_oxidize_pdf import OxidizePdfConverter
pipeline = Pipeline()
pipeline.add_component("converter", OxidizePdfConverter(mode="rag"))
# ...add embedder, writer, etc.
result = pipeline.run({"converter": {"sources": ["paper.pdf"]}})
ByteStream input
Unlike LangChain / LlamaIndex loaders which take only file paths, the Haystack converter accepts ByteStream objects natively, leveraging PdfReader.from_bytes under the hood:
from haystack.dataclasses import ByteStream
from haystack_oxidize_pdf import OxidizePdfConverter
with open("paper.pdf", "rb") as f:
stream = ByteStream(data=f.read(), mime_type="application/pdf",
meta={"upstream_origin": "s3://bucket/key"})
docs = OxidizePdfConverter().run(sources=[stream])["documents"]
# stream.meta is merged into each Document.meta (here: upstream_origin)
Batch sources with per-source metadata
docs = OxidizePdfConverter(mode="markdown").run(
sources=["doc-a.pdf", "doc-b.pdf"],
meta=[{"tag": "first"}, {"tag": "second"}],
)["documents"]
# docs[0].meta["tag"] == "first"
# docs[1].meta["tag"] == "second"
Or broadcast a single dict to every output document:
docs = OxidizePdfConverter().run(
sources=["a.pdf", "b.pdf"], meta={"source_tag": "batch-A"}
)["documents"]
# all docs carry source_tag == "batch-A"
Metadata precedence
Three layers, deepest wins:
- Base file-level fields (
file_path,file_name,total_pages,pdf_version). - Caller-supplied
meta(overrides base fields by design, lets callers re-label). - Per-document fields (
chunk_index,page_numbers,page_number) applied last and never overwritten.
Related packages
langchain-oxidize-pdf— same engine, LangChainBaseLoaderinterface.llama-index-readers-oxidize-pdf— same engine, LlamaIndexBaseReaderinterface.oxidize-pdf— the underlying PyO3 bridge (also ships theoxidize-mcpMCP server entry point).OxidizePdf.NET— .NET bindings.- oxidize-pdf core (Rust) — the Rust engine, 99.3% parse success on 9k+ real-world PDFs.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file haystack_oxidize_pdf-0.1.0.tar.gz.
File metadata
- Download URL: haystack_oxidize_pdf-0.1.0.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6b3c79ccef19f27e65ff32d9365a1f8bf798437a6c6efc97b8918fedc9cdd5b
|
|
| MD5 |
97db2f82d8dd3844373fed8264d7f542
|
|
| BLAKE2b-256 |
ffd7cc90d6e4eb4b1e653b483c71c02d7c77b4b2c45849cbfdcc865e908b4fbc
|
Provenance
The following attestation bundles were made for haystack_oxidize_pdf-0.1.0.tar.gz:
Publisher:
release-haystack.yml on bzsanti/oxidize-pdf-integrations
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
haystack_oxidize_pdf-0.1.0.tar.gz -
Subject digest:
e6b3c79ccef19f27e65ff32d9365a1f8bf798437a6c6efc97b8918fedc9cdd5b - Sigstore transparency entry: 1575245774
- Sigstore integration time:
-
Permalink:
bzsanti/oxidize-pdf-integrations@c96a1d697342437ceb1ccfad44b057b80542576d -
Branch / Tag:
refs/tags/haystack-v0.1.0 - Owner: https://github.com/bzsanti
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-haystack.yml@c96a1d697342437ceb1ccfad44b057b80542576d -
Trigger Event:
push
-
Statement type:
File details
Details for the file haystack_oxidize_pdf-0.1.0-py3-none-any.whl.
File metadata
- Download URL: haystack_oxidize_pdf-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ea0f0fe3ae762e71e88106cea7f6670ad0ac4149c96574e4e61916b47eb7f47
|
|
| MD5 |
e6c732d3c1627158779904b0ec9cd74c
|
|
| BLAKE2b-256 |
d3dca9faf9842e151b4812e17cdeeb0503fd30804caec89b9aa7be18bb06cbb7
|
Provenance
The following attestation bundles were made for haystack_oxidize_pdf-0.1.0-py3-none-any.whl:
Publisher:
release-haystack.yml on bzsanti/oxidize-pdf-integrations
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
haystack_oxidize_pdf-0.1.0-py3-none-any.whl -
Subject digest:
9ea0f0fe3ae762e71e88106cea7f6670ad0ac4149c96574e4e61916b47eb7f47 - Sigstore transparency entry: 1575245796
- Sigstore integration time:
-
Permalink:
bzsanti/oxidize-pdf-integrations@c96a1d697342437ceb1ccfad44b057b80542576d -
Branch / Tag:
refs/tags/haystack-v0.1.0 - Owner: https://github.com/bzsanti
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-haystack.yml@c96a1d697342437ceb1ccfad44b057b80542576d -
Trigger Event:
push
-
Statement type: