LlamaIndex reader for 88+ document formats powered by kreuzberg's Rust extraction engine

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

v-tan

These details have not been verified by PyPI

Project links

Kreuzberg

Project description

LlamaIndex Readers Kreuzberg

LlamaIndex reader for 88+ document formats powered by kreuzberg's Rust extraction engine.

Installation

pip install llama-index-readers-kreuzberg

Requires kreuzberg>=4.4.6 and llama-index-core>=0.13.0,<0.15.

Features

88+ formats -- PDF, DOCX, PPTX, XLSX, HTML, images, and more (full list)
Rich metadata -- quality scores, language detection, keywords, annotations
Element extraction -- structural elements for structure-aware RAG pipelines
Image extraction -- base64-encoded image data with position, format, and OCR metadata
Per-page splitting -- one Document per page for fine-grained retrieval
Batch processing -- multiple files in a single call
Raw bytes input -- extract from in-memory bytes with a MIME type
Native async -- true async via kreuzberg's Rust tokio runtime
Error tolerance -- skip failed files with warnings, or raise on failure
Full serialization -- custom ExtractionConfig round-trips through to_dict()/from_dict() for pipeline caching

Usage

Basic Extraction

from llama_index.readers.kreuzberg import KreuzbergReader

reader = KreuzbergReader()
documents = reader.load_data("report.pdf")

# Each document carries rich metadata
print(documents[0].metadata["file_name"])       # "report.pdf"
print(documents[0].metadata["file_type"])        # "application/pdf"
print(documents[0].metadata["total_pages"])      # 12
print(documents[0].metadata["quality_score"])    # 0.95
print(documents[0].metadata["detected_languages"])  # ["en"]

OCR Configuration

force_ocr is a top-level ExtractionConfig option. Language and backend are set on OcrConfig.

from kreuzberg import ExtractionConfig, OcrConfig

reader = KreuzbergReader(
    extraction_config=ExtractionConfig(
        force_ocr=True,
        ocr=OcrConfig(language="deu", backend="tesseract"),
    )
)
documents = reader.load_data("scanned.pdf")

Per-Page Splitting

PageConfig is nested inside ExtractionConfig. Each page becomes its own Document with a page_number metadata field.

from kreuzberg import ExtractionConfig, PageConfig

reader = KreuzbergReader(
    extraction_config=ExtractionConfig(
        pages=PageConfig(extract_pages=True),
    )
)
documents = reader.load_data("multi_page.pdf")  # One Document per page

for doc in documents:
    print(f"Page {doc.metadata['page_number']}: {doc.text[:80]}...")

Element Extraction

Setting result_format="element_based" populates _kreuzberg_elements in document metadata for structure-aware processing.

from kreuzberg import ExtractionConfig

reader = KreuzbergReader(
    extraction_config=ExtractionConfig(result_format="element_based")
)
documents = reader.load_data("report.pdf")

# Structural elements available for downstream node parsers
elements = documents[0].metadata["_kreuzberg_elements"]

Batch Processing

Pass a list of file paths to extract multiple files in one call.

reader = KreuzbergReader()
documents = reader.load_data(["report.pdf", "slides.pptx", "data.xlsx"])

Raw Bytes

Use data= and mime_type= keyword arguments to extract from in-memory bytes.

reader = KreuzbergReader()

# Single bytes input
documents = reader.load_data(data=pdf_bytes, mime_type="application/pdf")

# Batch bytes input -- parallel lists of data and MIME types
documents = reader.load_data(
    data=[pdf_bytes, docx_bytes],
    mime_type=["application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"],
)

Async

aload_data provides native async extraction backed by kreuzberg's Rust runtime.

documents = await reader.aload_data(["file1.pdf", "file2.pdf"])

SimpleDirectoryReader Integration

from llama_index.core import SimpleDirectoryReader

reader = KreuzbergReader()
sdr = SimpleDirectoryReader(
    input_dir="./documents",
    file_extractor={".pdf": reader, ".docx": reader, ".html": reader},
)
documents = sdr.load_data()

# Async variant works too
documents = await sdr.aload_data()

Behavior Notes

Error tolerance: By default, raise_on_error=False -- the reader logs warnings and skips files that fail extraction.
Strict mode: Set raise_on_error=True to propagate extraction exceptions immediately.
Deterministic IDs: Document IDs are SHA-256 hashes of the file path (or byte content) and page number, enabling stable deduplication across pipeline runs.
Metadata exclusion: Large metadata fields (_kreuzberg_elements, images) are automatically excluded from LLM and embedding metadata keys to keep prompt sizes manageable.
Table handling: Tables extracted by kreuzberg are appended as markdown to the document text when they are not already present in the content.
Serialization: The reader fully supports to_dict()/from_dict() round-tripping, including ExtractionConfig with nested OcrConfig and PageConfig. This enables pipeline caching and persistence with IngestionPipeline.

Metadata Reference

Each Document produced by KreuzbergReader includes these metadata fields (when available from the source document):

Field	Type	Description
`file_name`	`str`	Source file name or `"bytes"` for raw bytes input
`file_path`	`str`	Absolute path to the source file
`file_type`	`str`	MIME type of the source document
`total_pages`	`int`	Total page count of the source document
`page_number`	`int`	Page number (present only with per-page splitting)
`quality_score`	`float`	Extraction quality score (0.0 -- 1.0)
`detected_languages`	`list[str]`	ISO language codes detected in the text
`output_format`	`str`	Format of the extracted content (`"text"`, `"markdown"`, etc.)
`extracted_keywords`	`list[dict]`	Keywords with text, score, and algorithm
`annotations`	`list[dict]`	Document annotations (comments, highlights)
`processing_warnings`	`list[dict]`	Warnings encountered during extraction
`_kreuzberg_elements`	`list`	Structural elements (with `result_format="element_based"`)
`images`	`list[dict]`	Base64-encoded images with position and format metadata

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

v-tan

These details have not been verified by PyPI

Project links

Kreuzberg

Release history Release notifications | RSS feed

0.1.1

Apr 25, 2026

This version

0.1.0

Mar 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_kreuzberg-0.1.0.tar.gz (10.3 kB view details)

Uploaded Mar 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llama_index_readers_kreuzberg-0.1.0-py3-none-any.whl (12.4 kB view details)

Uploaded Mar 21, 2026 Python 3

File details

Details for the file llama_index_readers_kreuzberg-0.1.0.tar.gz.

File metadata

Download URL: llama_index_readers_kreuzberg-0.1.0.tar.gz
Upload date: Mar 21, 2026
Size: 10.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llama_index_readers_kreuzberg-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`146017515578bfc43d3b7342b3ec607d803b1e7ce0433986296db119d912f0d9`
MD5	`feca2d9f85a11b7f4a936a330ba32b81`
BLAKE2b-256	`1d46b7c32daccb426df1f7a27f657a286e55311ffd4ca20db3669af5dad30949`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llama_index_readers_kreuzberg-0.1.0.tar.gz:

Publisher: publish-reader.yaml on kreuzberg-dev/llama-index-kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llama_index_readers_kreuzberg-0.1.0.tar.gz
- Subject digest: 146017515578bfc43d3b7342b3ec607d803b1e7ce0433986296db119d912f0d9
- Sigstore transparency entry: 1154412045
- Sigstore integration time: Mar 21, 2026
Source repository:
- Permalink: kreuzberg-dev/llama-index-kreuzberg@7084e3b62befa664f4c159f1951430c03d3b744f
- Branch / Tag: refs/tags/reader-v0.1.0
- Owner: https://github.com/kreuzberg-dev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-reader.yaml@7084e3b62befa664f4c159f1951430c03d3b744f
- Trigger Event: push

File details

Details for the file llama_index_readers_kreuzberg-0.1.0-py3-none-any.whl.

File metadata

Download URL: llama_index_readers_kreuzberg-0.1.0-py3-none-any.whl
Upload date: Mar 21, 2026
Size: 12.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llama_index_readers_kreuzberg-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`182f9e01c61638bd8079e04ce53bdef335d44d76e5350b028fa7dba5359b50b7`
MD5	`f9a7c355740dd07ab430e4aed27cd441`
BLAKE2b-256	`20f0ed549fffe6474a209479218b3aa9055e479813d1291cc4087a2ab43c5495`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llama_index_readers_kreuzberg-0.1.0-py3-none-any.whl:

Publisher: publish-reader.yaml on kreuzberg-dev/llama-index-kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llama_index_readers_kreuzberg-0.1.0-py3-none-any.whl
- Subject digest: 182f9e01c61638bd8079e04ce53bdef335d44d76e5350b028fa7dba5359b50b7
- Sigstore transparency entry: 1154412048
- Sigstore integration time: Mar 21, 2026
Source repository:
- Permalink: kreuzberg-dev/llama-index-kreuzberg@7084e3b62befa664f4c159f1951430c03d3b744f
- Branch / Tag: refs/tags/reader-v0.1.0
- Owner: https://github.com/kreuzberg-dev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-reader.yaml@7084e3b62befa664f4c159f1951430c03d3b744f
- Trigger Event: push

llama-index-readers-kreuzberg 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LlamaIndex Readers Kreuzberg

Installation

Features

Usage

Basic Extraction

OCR Configuration

Per-Page Splitting

Element Extraction

Batch Processing

Raw Bytes

Async

SimpleDirectoryReader Integration

Behavior Notes

Metadata Reference

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance