LlamaIndex reader for 88+ document formats powered by kreuzberg's Rust extraction engine
Project description
LlamaIndex Readers Kreuzberg
LlamaIndex reader for 88+ document formats powered by kreuzberg's Rust extraction engine.
Installation
pip install llama-index-readers-kreuzberg
Requires kreuzberg>=4.4.6 and llama-index-core>=0.13.0,<0.15.
Features
- 88+ formats -- PDF, DOCX, PPTX, XLSX, HTML, images, and more (full list)
- Rich metadata -- quality scores, language detection, keywords, annotations
- Element extraction -- structural elements for structure-aware RAG pipelines
- Image extraction -- base64-encoded image data with position, format, and OCR metadata
- Per-page splitting -- one
Documentper page for fine-grained retrieval - Batch processing -- multiple files in a single call
- Raw bytes input -- extract from in-memory bytes with a MIME type
- Native async -- true async via kreuzberg's Rust tokio runtime
- Error tolerance -- skip failed files with warnings, or raise on failure
- Full serialization -- custom
ExtractionConfiground-trips throughto_dict()/from_dict()for pipeline caching
Usage
Basic Extraction
from llama_index.readers.kreuzberg import KreuzbergReader
reader = KreuzbergReader()
documents = reader.load_data("report.pdf")
# Each document carries rich metadata
print(documents[0].metadata["file_name"]) # "report.pdf"
print(documents[0].metadata["file_type"]) # "application/pdf"
print(documents[0].metadata["total_pages"]) # 12
print(documents[0].metadata["quality_score"]) # 0.95
print(documents[0].metadata["detected_languages"]) # ["en"]
OCR Configuration
force_ocr is a top-level ExtractionConfig option. Language and backend are set on OcrConfig.
from kreuzberg import ExtractionConfig, OcrConfig
reader = KreuzbergReader(
extraction_config=ExtractionConfig(
force_ocr=True,
ocr=OcrConfig(language="deu", backend="tesseract"),
)
)
documents = reader.load_data("scanned.pdf")
Per-Page Splitting
PageConfig is nested inside ExtractionConfig. Each page becomes its own Document
with a page_number metadata field.
from kreuzberg import ExtractionConfig, PageConfig
reader = KreuzbergReader(
extraction_config=ExtractionConfig(
pages=PageConfig(extract_pages=True),
)
)
documents = reader.load_data("multi_page.pdf") # One Document per page
for doc in documents:
print(f"Page {doc.metadata['page_number']}: {doc.text[:80]}...")
Element Extraction
Setting result_format="element_based" populates _kreuzberg_elements in document
metadata for structure-aware processing.
from kreuzberg import ExtractionConfig
reader = KreuzbergReader(
extraction_config=ExtractionConfig(result_format="element_based")
)
documents = reader.load_data("report.pdf")
# Structural elements available for downstream node parsers
elements = documents[0].metadata["_kreuzberg_elements"]
Batch Processing
Pass a list of file paths to extract multiple files in one call.
reader = KreuzbergReader()
documents = reader.load_data(["report.pdf", "slides.pptx", "data.xlsx"])
Raw Bytes
Use data= and mime_type= keyword arguments to extract from in-memory bytes.
reader = KreuzbergReader()
# Single bytes input
documents = reader.load_data(data=pdf_bytes, mime_type="application/pdf")
# Batch bytes input -- parallel lists of data and MIME types
documents = reader.load_data(
data=[pdf_bytes, docx_bytes],
mime_type=["application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"],
)
Async
aload_data provides native async extraction backed by kreuzberg's Rust runtime.
documents = await reader.aload_data(["file1.pdf", "file2.pdf"])
SimpleDirectoryReader Integration
Register KreuzbergReader as a file extractor for any supported extension.
from llama_index.core import SimpleDirectoryReader
reader = KreuzbergReader()
sdr = SimpleDirectoryReader(
input_dir="./documents",
file_extractor={".pdf": reader, ".docx": reader, ".html": reader},
)
documents = sdr.load_data()
# Async variant works too
documents = await sdr.aload_data()
Behavior Notes
- Error tolerance: By default,
raise_on_error=False-- the reader logs warnings and skips files that fail extraction. - Strict mode: Set
raise_on_error=Trueto propagate extraction exceptions immediately. - Deterministic IDs: Document IDs are SHA-256 hashes of the file path (or byte content) and page number, enabling stable deduplication across pipeline runs.
- Metadata exclusion: Large metadata fields (
_kreuzberg_elements,images) are automatically excluded from LLM and embedding metadata keys to keep prompt sizes manageable. - Table handling: Tables extracted by kreuzberg are appended as markdown to the document text when they are not already present in the content.
- Serialization: The reader fully supports
to_dict()/from_dict()round-tripping, includingExtractionConfigwith nestedOcrConfigandPageConfig. This enables pipeline caching and persistence withIngestionPipeline.
Metadata Reference
Each Document produced by KreuzbergReader includes these metadata fields (when available from the source document):
| Field | Type | Description |
|---|---|---|
file_name |
str |
Source file name or "bytes" for raw bytes input |
file_path |
str |
Absolute path to the source file |
file_type |
str |
MIME type of the source document |
total_pages |
int |
Total page count of the source document |
page_number |
int |
Page number (present only with per-page splitting) |
quality_score |
float |
Extraction quality score (0.0 -- 1.0) |
detected_languages |
list[str] |
ISO language codes detected in the text |
output_format |
str |
Format of the extracted content ("text", "markdown", etc.) |
extracted_keywords |
list[dict] |
Keywords with text, score, and algorithm |
annotations |
list[dict] |
Document annotations (comments, highlights) |
processing_warnings |
list[dict] |
Warnings encountered during extraction |
_kreuzberg_elements |
list |
Structural elements (with result_format="element_based") |
images |
list[dict] |
Base64-encoded images with position and format metadata |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llama_index_readers_kreuzberg-0.1.1.tar.gz.
File metadata
- Download URL: llama_index_readers_kreuzberg-0.1.1.tar.gz
- Upload date:
- Size: 10.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bce80572687cde7fc1f286a1737a9ed1e7d83955d6ba0cf359b5a8293255c03d
|
|
| MD5 |
38341116af2fab27be2c30f939467820
|
|
| BLAKE2b-256 |
6b158354e9a4fe40d43a92a73c18b9161af4e2b8b905f3ae6a6b10256dc20303
|
Provenance
The following attestation bundles were made for llama_index_readers_kreuzberg-0.1.1.tar.gz:
Publisher:
publish-reader.yaml on kreuzberg-dev/llama-index-kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llama_index_readers_kreuzberg-0.1.1.tar.gz -
Subject digest:
bce80572687cde7fc1f286a1737a9ed1e7d83955d6ba0cf359b5a8293255c03d - Sigstore transparency entry: 1382816012
- Sigstore integration time:
-
Permalink:
kreuzberg-dev/llama-index-kreuzberg@1d0c3a131143c25dfd7c88318b8d71bad3e515fb -
Branch / Tag:
refs/tags/reader-v0.1.1 - Owner: https://github.com/kreuzberg-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-reader.yaml@1d0c3a131143c25dfd7c88318b8d71bad3e515fb -
Trigger Event:
push
-
Statement type:
File details
Details for the file llama_index_readers_kreuzberg-0.1.1-py3-none-any.whl.
File metadata
- Download URL: llama_index_readers_kreuzberg-0.1.1-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3072d750eb512c91a0ac605e0371a88222862e4fc319686224d3f6fbd5e109a2
|
|
| MD5 |
e0b29586c3330af4d4ea8c193e49b87f
|
|
| BLAKE2b-256 |
1c9070326293bb01e1a073e22018f965e6558400280be0711c358c73d127c1cc
|
Provenance
The following attestation bundles were made for llama_index_readers_kreuzberg-0.1.1-py3-none-any.whl:
Publisher:
publish-reader.yaml on kreuzberg-dev/llama-index-kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llama_index_readers_kreuzberg-0.1.1-py3-none-any.whl -
Subject digest:
3072d750eb512c91a0ac605e0371a88222862e4fc319686224d3f6fbd5e109a2 - Sigstore transparency entry: 1382816209
- Sigstore integration time:
-
Permalink:
kreuzberg-dev/llama-index-kreuzberg@1d0c3a131143c25dfd7c88318b8d71bad3e515fb -
Branch / Tag:
refs/tags/reader-v0.1.1 - Owner: https://github.com/kreuzberg-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-reader.yaml@1d0c3a131143c25dfd7c88318b8d71bad3e515fb -
Trigger Event:
push
-
Statement type: