Haystack converter for oxidize-pdf — fast, Rust-powered PDF parsing with element-disjoint RAG chunking, ByteStream-aware

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bytaro

These details have not been verified by PyPI

Project description

haystack-oxidize-pdf

Haystack converter backed by oxidize-pdf, a fast Rust-powered PDF engine with first-class RAG chunking.

0.1.0 (2026-05-19) — Requires oxidize-pdf>=0.4.3 (oxidize-pdf-core 2.5.5+). Ships from day one with the full semantic regression suite (test_converter_disjoint.py, 6 tests) that guarantees the RAG-chunk disjointness contract end-to-end. Mirrors the discipline applied to langchain-oxidize-pdf 0.1.0 after the llama-index-readers-oxidize-pdf 0.1.0 → 0.1.1 incident, where shape-only tests missed a quadratic accumulation bug in the underlying chunker.

Install

pip install haystack-oxidize-pdf

Usage

The converter is a Haystack @component and can be dropped straight into a Pipeline. Sources accept paths (str / pathlib.Path) and ByteStream objects interchangeably.

Three modes

Mode	Output	Use case
`rag` (default)	one `Document` per RAG chunk with `chunk_index`, `page_numbers`, `element_types`, `heading_context`, `token_estimate`	Vector-store ingestion for RAG
`pages`	one `Document` per page (plain text) with `page_number`	Page-level indexing or compatibility with PyPDFToDocument-style pipelines
`markdown`	one `Document` per source containing the whole PDF as markdown	Single-document export, no chunking

RAG chunks (default)

from haystack_oxidize_pdf import OxidizePdfConverter

converter = OxidizePdfConverter()  # mode="rag"
result = converter.run(sources=["paper.pdf"])

for doc in result["documents"]:
    print(doc.meta["chunk_index"], doc.meta["heading_context"])
    print(doc.content[:200])

Each Document.meta for mode="rag" carries:

Field	Description
`chunk_index`	0-based index within the source (resets per source in batch mode)
`page_numbers`	list of 1-indexed pages covered by the chunk
`element_types`	list of semantic types detected (e.g. `title`, `paragraph`)
`heading_context`	`"Section title > Subsection"` string, prepended to `content`
`token_estimate`	conservative token-count estimate for chunk sizing
`file_path`, `file_name`, `total_pages`, `pdf_version`	source-level fields

Pipeline integration

from haystack import Pipeline
from haystack_oxidize_pdf import OxidizePdfConverter

pipeline = Pipeline()
pipeline.add_component("converter", OxidizePdfConverter(mode="rag"))
# ...add embedder, writer, etc.

result = pipeline.run({"converter": {"sources": ["paper.pdf"]}})

ByteStream input

Unlike LangChain / LlamaIndex loaders which take only file paths, the Haystack converter accepts ByteStream objects natively, leveraging PdfReader.from_bytes under the hood:

from haystack.dataclasses import ByteStream
from haystack_oxidize_pdf import OxidizePdfConverter

with open("paper.pdf", "rb") as f:
    stream = ByteStream(data=f.read(), mime_type="application/pdf",
                        meta={"upstream_origin": "s3://bucket/key"})

docs = OxidizePdfConverter().run(sources=[stream])["documents"]
# stream.meta is merged into each Document.meta (here: upstream_origin)

Batch sources with per-source metadata

docs = OxidizePdfConverter(mode="markdown").run(
    sources=["doc-a.pdf", "doc-b.pdf"],
    meta=[{"tag": "first"}, {"tag": "second"}],
)["documents"]
# docs[0].meta["tag"] == "first"
# docs[1].meta["tag"] == "second"

Or broadcast a single dict to every output document:

docs = OxidizePdfConverter().run(
    sources=["a.pdf", "b.pdf"], meta={"source_tag": "batch-A"}
)["documents"]
# all docs carry source_tag == "batch-A"

Metadata precedence

Three layers, deepest wins:

Base file-level fields (file_path, file_name, total_pages, pdf_version).
Caller-supplied meta (overrides base fields by design, lets callers re-label).
Per-document fields (chunk_index, page_numbers, page_number) applied last and never overwritten.

Related packages

langchain-oxidize-pdf — same engine, LangChain BaseLoader interface.
llama-index-readers-oxidize-pdf — same engine, LlamaIndex BaseReader interface.
oxidize-pdf — the underlying PyO3 bridge (also ships the oxidize-mcp MCP server entry point).
OxidizePdf.NET — .NET bindings.
oxidize-pdf core (Rust) — the Rust engine, 99.3% parse success on 9k+ real-world PDFs.

License

MIT.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bytaro

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haystack_oxidize_pdf-0.1.0.tar.gz (5.0 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

haystack_oxidize_pdf-0.1.0-py3-none-any.whl (5.5 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file haystack_oxidize_pdf-0.1.0.tar.gz.

File metadata

Download URL: haystack_oxidize_pdf-0.1.0.tar.gz
Upload date: May 19, 2026
Size: 5.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for haystack_oxidize_pdf-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e6b3c79ccef19f27e65ff32d9365a1f8bf798437a6c6efc97b8918fedc9cdd5b`
MD5	`97db2f82d8dd3844373fed8264d7f542`
BLAKE2b-256	`ffd7cc90d6e4eb4b1e653b483c71c02d7c77b4b2c45849cbfdcc865e908b4fbc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for haystack_oxidize_pdf-0.1.0.tar.gz:

Publisher: release-haystack.yml on bzsanti/oxidize-pdf-integrations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: haystack_oxidize_pdf-0.1.0.tar.gz
- Subject digest: e6b3c79ccef19f27e65ff32d9365a1f8bf798437a6c6efc97b8918fedc9cdd5b
- Sigstore transparency entry: 1575245774
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: bzsanti/oxidize-pdf-integrations@c96a1d697342437ceb1ccfad44b057b80542576d
- Branch / Tag: refs/tags/haystack-v0.1.0
- Owner: https://github.com/bzsanti
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-haystack.yml@c96a1d697342437ceb1ccfad44b057b80542576d
- Trigger Event: push

File details

Details for the file haystack_oxidize_pdf-0.1.0-py3-none-any.whl.

File metadata

Download URL: haystack_oxidize_pdf-0.1.0-py3-none-any.whl
Upload date: May 19, 2026
Size: 5.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for haystack_oxidize_pdf-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9ea0f0fe3ae762e71e88106cea7f6670ad0ac4149c96574e4e61916b47eb7f47`
MD5	`e6c732d3c1627158779904b0ec9cd74c`
BLAKE2b-256	`d3dca9faf9842e151b4812e17cdeeb0503fd30804caec89b9aa7be18bb06cbb7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for haystack_oxidize_pdf-0.1.0-py3-none-any.whl:

Publisher: release-haystack.yml on bzsanti/oxidize-pdf-integrations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: haystack_oxidize_pdf-0.1.0-py3-none-any.whl
- Subject digest: 9ea0f0fe3ae762e71e88106cea7f6670ad0ac4149c96574e4e61916b47eb7f47
- Sigstore transparency entry: 1575245796
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: bzsanti/oxidize-pdf-integrations@c96a1d697342437ceb1ccfad44b057b80542576d
- Branch / Tag: refs/tags/haystack-v0.1.0
- Owner: https://github.com/bzsanti
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-haystack.yml@c96a1d697342437ceb1ccfad44b057b80542576d
- Trigger Event: push

haystack-oxidize-pdf 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

haystack-oxidize-pdf

Install

Usage

Three modes

RAG chunks (default)

Pipeline integration

ByteStream input

Batch sources with per-source metadata

Metadata precedence

Related packages

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance