Skip to main content

Haystack converter for oxidize-pdf — fast, Rust-powered PDF parsing with element-disjoint RAG chunking, ByteStream-aware

Project description

haystack-oxidize-pdf

Haystack converter backed by oxidize-pdf, a fast Rust-powered PDF engine with first-class RAG chunking.

0.1.0 (2026-05-19) — Requires oxidize-pdf>=0.4.3 (oxidize-pdf-core 2.5.5+). Ships from day one with the full semantic regression suite (test_converter_disjoint.py, 6 tests) that guarantees the RAG-chunk disjointness contract end-to-end. Mirrors the discipline applied to langchain-oxidize-pdf 0.1.0 after the llama-index-readers-oxidize-pdf 0.1.0 → 0.1.1 incident, where shape-only tests missed a quadratic accumulation bug in the underlying chunker.

Install

pip install haystack-oxidize-pdf

Usage

The converter is a Haystack @component and can be dropped straight into a Pipeline. Sources accept paths (str / pathlib.Path) and ByteStream objects interchangeably.

Three modes

Mode Output Use case
rag (default) one Document per RAG chunk with chunk_index, page_numbers, element_types, heading_context, token_estimate Vector-store ingestion for RAG
pages one Document per page (plain text) with page_number Page-level indexing or compatibility with PyPDFToDocument-style pipelines
markdown one Document per source containing the whole PDF as markdown Single-document export, no chunking

RAG chunks (default)

from haystack_oxidize_pdf import OxidizePdfConverter

converter = OxidizePdfConverter()  # mode="rag"
result = converter.run(sources=["paper.pdf"])

for doc in result["documents"]:
    print(doc.meta["chunk_index"], doc.meta["heading_context"])
    print(doc.content[:200])

Each Document.meta for mode="rag" carries:

Field Description
chunk_index 0-based index within the source (resets per source in batch mode)
page_numbers list of 1-indexed pages covered by the chunk
element_types list of semantic types detected (e.g. title, paragraph)
heading_context "Section title > Subsection" string, prepended to content
token_estimate conservative token-count estimate for chunk sizing
file_path, file_name, total_pages, pdf_version source-level fields

Pipeline integration

from haystack import Pipeline
from haystack_oxidize_pdf import OxidizePdfConverter

pipeline = Pipeline()
pipeline.add_component("converter", OxidizePdfConverter(mode="rag"))
# ...add embedder, writer, etc.

result = pipeline.run({"converter": {"sources": ["paper.pdf"]}})

ByteStream input

Unlike LangChain / LlamaIndex loaders which take only file paths, the Haystack converter accepts ByteStream objects natively, leveraging PdfReader.from_bytes under the hood:

from haystack.dataclasses import ByteStream
from haystack_oxidize_pdf import OxidizePdfConverter

with open("paper.pdf", "rb") as f:
    stream = ByteStream(data=f.read(), mime_type="application/pdf",
                        meta={"upstream_origin": "s3://bucket/key"})

docs = OxidizePdfConverter().run(sources=[stream])["documents"]
# stream.meta is merged into each Document.meta (here: upstream_origin)

Batch sources with per-source metadata

docs = OxidizePdfConverter(mode="markdown").run(
    sources=["doc-a.pdf", "doc-b.pdf"],
    meta=[{"tag": "first"}, {"tag": "second"}],
)["documents"]
# docs[0].meta["tag"] == "first"
# docs[1].meta["tag"] == "second"

Or broadcast a single dict to every output document:

docs = OxidizePdfConverter().run(
    sources=["a.pdf", "b.pdf"], meta={"source_tag": "batch-A"}
)["documents"]
# all docs carry source_tag == "batch-A"

Metadata precedence

Three layers, deepest wins:

  1. Base file-level fields (file_path, file_name, total_pages, pdf_version).
  2. Caller-supplied meta (overrides base fields by design, lets callers re-label).
  3. Per-document fields (chunk_index, page_numbers, page_number) applied last and never overwritten.

Related packages

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haystack_oxidize_pdf-0.1.0.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

haystack_oxidize_pdf-0.1.0-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file haystack_oxidize_pdf-0.1.0.tar.gz.

File metadata

  • Download URL: haystack_oxidize_pdf-0.1.0.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for haystack_oxidize_pdf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e6b3c79ccef19f27e65ff32d9365a1f8bf798437a6c6efc97b8918fedc9cdd5b
MD5 97db2f82d8dd3844373fed8264d7f542
BLAKE2b-256 ffd7cc90d6e4eb4b1e653b483c71c02d7c77b4b2c45849cbfdcc865e908b4fbc

See more details on using hashes here.

Provenance

The following attestation bundles were made for haystack_oxidize_pdf-0.1.0.tar.gz:

Publisher: release-haystack.yml on bzsanti/oxidize-pdf-integrations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file haystack_oxidize_pdf-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for haystack_oxidize_pdf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ea0f0fe3ae762e71e88106cea7f6670ad0ac4149c96574e4e61916b47eb7f47
MD5 e6c732d3c1627158779904b0ec9cd74c
BLAKE2b-256 d3dca9faf9842e151b4812e17cdeeb0503fd30804caec89b9aa7be18bb06cbb7

See more details on using hashes here.

Provenance

The following attestation bundles were made for haystack_oxidize_pdf-0.1.0-py3-none-any.whl:

Publisher: release-haystack.yml on bzsanti/oxidize-pdf-integrations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page