Skip to main content

LlamaIndex reader for 88+ document formats powered by kreuzberg's Rust extraction engine

Project description

LlamaIndex Readers Kreuzberg

Kreuzberg Banner

LlamaIndex reader for 88+ document formats powered by kreuzberg's Rust extraction engine.

Installation

pip install llama-index-readers-kreuzberg

Requires kreuzberg>=4.4.6 and llama-index-core>=0.13.0,<0.15.

Features

  • 88+ formats -- PDF, DOCX, PPTX, XLSX, HTML, images, and more (full list)
  • Rich metadata -- quality scores, language detection, keywords, annotations
  • Element extraction -- structural elements for structure-aware RAG pipelines
  • Image extraction -- base64-encoded image data with position, format, and OCR metadata
  • Per-page splitting -- one Document per page for fine-grained retrieval
  • Batch processing -- multiple files in a single call
  • Raw bytes input -- extract from in-memory bytes with a MIME type
  • Native async -- true async via kreuzberg's Rust tokio runtime
  • Error tolerance -- skip failed files with warnings, or raise on failure
  • Full serialization -- custom ExtractionConfig round-trips through to_dict()/from_dict() for pipeline caching

Usage

Basic Extraction

from llama_index.readers.kreuzberg import KreuzbergReader

reader = KreuzbergReader()
documents = reader.load_data("report.pdf")

# Each document carries rich metadata
print(documents[0].metadata["file_name"])       # "report.pdf"
print(documents[0].metadata["file_type"])        # "application/pdf"
print(documents[0].metadata["total_pages"])      # 12
print(documents[0].metadata["quality_score"])    # 0.95
print(documents[0].metadata["detected_languages"])  # ["en"]

OCR Configuration

force_ocr is a top-level ExtractionConfig option. Language and backend are set on OcrConfig.

from kreuzberg import ExtractionConfig, OcrConfig

reader = KreuzbergReader(
    extraction_config=ExtractionConfig(
        force_ocr=True,
        ocr=OcrConfig(language="deu", backend="tesseract"),
    )
)
documents = reader.load_data("scanned.pdf")

Per-Page Splitting

PageConfig is nested inside ExtractionConfig. Each page becomes its own Document with a page_number metadata field.

from kreuzberg import ExtractionConfig, PageConfig

reader = KreuzbergReader(
    extraction_config=ExtractionConfig(
        pages=PageConfig(extract_pages=True),
    )
)
documents = reader.load_data("multi_page.pdf")  # One Document per page

for doc in documents:
    print(f"Page {doc.metadata['page_number']}: {doc.text[:80]}...")

Element Extraction

Setting result_format="element_based" populates _kreuzberg_elements in document metadata for structure-aware processing.

from kreuzberg import ExtractionConfig

reader = KreuzbergReader(
    extraction_config=ExtractionConfig(result_format="element_based")
)
documents = reader.load_data("report.pdf")

# Structural elements available for downstream node parsers
elements = documents[0].metadata["_kreuzberg_elements"]

Batch Processing

Pass a list of file paths to extract multiple files in one call.

reader = KreuzbergReader()
documents = reader.load_data(["report.pdf", "slides.pptx", "data.xlsx"])

Raw Bytes

Use data= and mime_type= keyword arguments to extract from in-memory bytes.

reader = KreuzbergReader()

# Single bytes input
documents = reader.load_data(data=pdf_bytes, mime_type="application/pdf")

# Batch bytes input -- parallel lists of data and MIME types
documents = reader.load_data(
    data=[pdf_bytes, docx_bytes],
    mime_type=["application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"],
)

Async

aload_data provides native async extraction backed by kreuzberg's Rust runtime.

documents = await reader.aload_data(["file1.pdf", "file2.pdf"])

SimpleDirectoryReader Integration

Register KreuzbergReader as a file extractor for any supported extension.

from llama_index.core import SimpleDirectoryReader

reader = KreuzbergReader()
sdr = SimpleDirectoryReader(
    input_dir="./documents",
    file_extractor={".pdf": reader, ".docx": reader, ".html": reader},
)
documents = sdr.load_data()

# Async variant works too
documents = await sdr.aload_data()

Behavior Notes

  • Error tolerance: By default, raise_on_error=False -- the reader logs warnings and skips files that fail extraction.
  • Strict mode: Set raise_on_error=True to propagate extraction exceptions immediately.
  • Deterministic IDs: Document IDs are SHA-256 hashes of the file path (or byte content) and page number, enabling stable deduplication across pipeline runs.
  • Metadata exclusion: Large metadata fields (_kreuzberg_elements, images) are automatically excluded from LLM and embedding metadata keys to keep prompt sizes manageable.
  • Table handling: Tables extracted by kreuzberg are appended as markdown to the document text when they are not already present in the content.
  • Serialization: The reader fully supports to_dict()/from_dict() round-tripping, including ExtractionConfig with nested OcrConfig and PageConfig. This enables pipeline caching and persistence with IngestionPipeline.

Metadata Reference

Each Document produced by KreuzbergReader includes these metadata fields (when available from the source document):

Field Type Description
file_name str Source file name or "bytes" for raw bytes input
file_path str Absolute path to the source file
file_type str MIME type of the source document
total_pages int Total page count of the source document
page_number int Page number (present only with per-page splitting)
quality_score float Extraction quality score (0.0 -- 1.0)
detected_languages list[str] ISO language codes detected in the text
output_format str Format of the extracted content ("text", "markdown", etc.)
extracted_keywords list[dict] Keywords with text, score, and algorithm
annotations list[dict] Document annotations (comments, highlights)
processing_warnings list[dict] Warnings encountered during extraction
_kreuzberg_elements list Structural elements (with result_format="element_based")
images list[dict] Base64-encoded images with position and format metadata

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_kreuzberg-0.1.0.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_readers_kreuzberg-0.1.0-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_readers_kreuzberg-0.1.0.tar.gz.

File metadata

File hashes

Hashes for llama_index_readers_kreuzberg-0.1.0.tar.gz
Algorithm Hash digest
SHA256 146017515578bfc43d3b7342b3ec607d803b1e7ce0433986296db119d912f0d9
MD5 feca2d9f85a11b7f4a936a330ba32b81
BLAKE2b-256 1d46b7c32daccb426df1f7a27f657a286e55311ffd4ca20db3669af5dad30949

See more details on using hashes here.

Provenance

The following attestation bundles were made for llama_index_readers_kreuzberg-0.1.0.tar.gz:

Publisher: publish-reader.yaml on kreuzberg-dev/llama-index-kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llama_index_readers_kreuzberg-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_readers_kreuzberg-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 182f9e01c61638bd8079e04ce53bdef335d44d76e5350b028fa7dba5359b50b7
MD5 f9a7c355740dd07ab430e4aed27cd441
BLAKE2b-256 20f0ed549fffe6474a209479218b3aa9055e479813d1291cc4087a2ab43c5495

See more details on using hashes here.

Provenance

The following attestation bundles were made for llama_index_readers_kreuzberg-0.1.0-py3-none-any.whl:

Publisher: publish-reader.yaml on kreuzberg-dev/llama-index-kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page