Skip to main content

LlamaIndex reader for 88+ document formats powered by kreuzberg's Rust extraction engine

Project description

LlamaIndex Readers Kreuzberg

Kreuzberg Banner

LlamaIndex reader for 88+ document formats powered by kreuzberg's Rust extraction engine.

Installation

pip install llama-index-readers-kreuzberg

Requires kreuzberg>=4.4.6 and llama-index-core>=0.13.0,<0.15.

Features

  • 88+ formats -- PDF, DOCX, PPTX, XLSX, HTML, images, and more (full list)
  • Rich metadata -- quality scores, language detection, keywords, annotations
  • Element extraction -- structural elements for structure-aware RAG pipelines
  • Image extraction -- base64-encoded image data with position, format, and OCR metadata
  • Per-page splitting -- one Document per page for fine-grained retrieval
  • Batch processing -- multiple files in a single call
  • Raw bytes input -- extract from in-memory bytes with a MIME type
  • Native async -- true async via kreuzberg's Rust tokio runtime
  • Error tolerance -- skip failed files with warnings, or raise on failure
  • Full serialization -- custom ExtractionConfig round-trips through to_dict()/from_dict() for pipeline caching

Usage

Basic Extraction

from llama_index.readers.kreuzberg import KreuzbergReader

reader = KreuzbergReader()
documents = reader.load_data("report.pdf")

# Each document carries rich metadata
print(documents[0].metadata["file_name"])       # "report.pdf"
print(documents[0].metadata["file_type"])        # "application/pdf"
print(documents[0].metadata["total_pages"])      # 12
print(documents[0].metadata["quality_score"])    # 0.95
print(documents[0].metadata["detected_languages"])  # ["en"]

OCR Configuration

force_ocr is a top-level ExtractionConfig option. Language and backend are set on OcrConfig.

from kreuzberg import ExtractionConfig, OcrConfig

reader = KreuzbergReader(
    extraction_config=ExtractionConfig(
        force_ocr=True,
        ocr=OcrConfig(language="deu", backend="tesseract"),
    )
)
documents = reader.load_data("scanned.pdf")

Per-Page Splitting

PageConfig is nested inside ExtractionConfig. Each page becomes its own Document with a page_number metadata field.

from kreuzberg import ExtractionConfig, PageConfig

reader = KreuzbergReader(
    extraction_config=ExtractionConfig(
        pages=PageConfig(extract_pages=True),
    )
)
documents = reader.load_data("multi_page.pdf")  # One Document per page

for doc in documents:
    print(f"Page {doc.metadata['page_number']}: {doc.text[:80]}...")

Element Extraction

Setting result_format="element_based" populates _kreuzberg_elements in document metadata for structure-aware processing.

from kreuzberg import ExtractionConfig

reader = KreuzbergReader(
    extraction_config=ExtractionConfig(result_format="element_based")
)
documents = reader.load_data("report.pdf")

# Structural elements available for downstream node parsers
elements = documents[0].metadata["_kreuzberg_elements"]

Batch Processing

Pass a list of file paths to extract multiple files in one call.

reader = KreuzbergReader()
documents = reader.load_data(["report.pdf", "slides.pptx", "data.xlsx"])

Raw Bytes

Use data= and mime_type= keyword arguments to extract from in-memory bytes.

reader = KreuzbergReader()

# Single bytes input
documents = reader.load_data(data=pdf_bytes, mime_type="application/pdf")

# Batch bytes input -- parallel lists of data and MIME types
documents = reader.load_data(
    data=[pdf_bytes, docx_bytes],
    mime_type=["application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"],
)

Async

aload_data provides native async extraction backed by kreuzberg's Rust runtime.

documents = await reader.aload_data(["file1.pdf", "file2.pdf"])

SimpleDirectoryReader Integration

Register KreuzbergReader as a file extractor for any supported extension.

from llama_index.core import SimpleDirectoryReader

reader = KreuzbergReader()
sdr = SimpleDirectoryReader(
    input_dir="./documents",
    file_extractor={".pdf": reader, ".docx": reader, ".html": reader},
)
documents = sdr.load_data()

# Async variant works too
documents = await sdr.aload_data()

Behavior Notes

  • Error tolerance: By default, raise_on_error=False -- the reader logs warnings and skips files that fail extraction.
  • Strict mode: Set raise_on_error=True to propagate extraction exceptions immediately.
  • Deterministic IDs: Document IDs are SHA-256 hashes of the file path (or byte content) and page number, enabling stable deduplication across pipeline runs.
  • Metadata exclusion: Large metadata fields (_kreuzberg_elements, images) are automatically excluded from LLM and embedding metadata keys to keep prompt sizes manageable.
  • Table handling: Tables extracted by kreuzberg are appended as markdown to the document text when they are not already present in the content.
  • Serialization: The reader fully supports to_dict()/from_dict() round-tripping, including ExtractionConfig with nested OcrConfig and PageConfig. This enables pipeline caching and persistence with IngestionPipeline.

Metadata Reference

Each Document produced by KreuzbergReader includes these metadata fields (when available from the source document):

Field Type Description
file_name str Source file name or "bytes" for raw bytes input
file_path str Absolute path to the source file
file_type str MIME type of the source document
total_pages int Total page count of the source document
page_number int Page number (present only with per-page splitting)
quality_score float Extraction quality score (0.0 -- 1.0)
detected_languages list[str] ISO language codes detected in the text
output_format str Format of the extracted content ("text", "markdown", etc.)
extracted_keywords list[dict] Keywords with text, score, and algorithm
annotations list[dict] Document annotations (comments, highlights)
processing_warnings list[dict] Warnings encountered during extraction
_kreuzberg_elements list Structural elements (with result_format="element_based")
images list[dict] Base64-encoded images with position and format metadata

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_kreuzberg-0.1.1.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_readers_kreuzberg-0.1.1-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_readers_kreuzberg-0.1.1.tar.gz.

File metadata

File hashes

Hashes for llama_index_readers_kreuzberg-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bce80572687cde7fc1f286a1737a9ed1e7d83955d6ba0cf359b5a8293255c03d
MD5 38341116af2fab27be2c30f939467820
BLAKE2b-256 6b158354e9a4fe40d43a92a73c18b9161af4e2b8b905f3ae6a6b10256dc20303

See more details on using hashes here.

Provenance

The following attestation bundles were made for llama_index_readers_kreuzberg-0.1.1.tar.gz:

Publisher: publish-reader.yaml on kreuzberg-dev/llama-index-kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llama_index_readers_kreuzberg-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_readers_kreuzberg-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3072d750eb512c91a0ac605e0371a88222862e4fc319686224d3f6fbd5e109a2
MD5 e0b29586c3330af4d4ea8c193e49b87f
BLAKE2b-256 1c9070326293bb01e1a073e22018f965e6558400280be0711c358c73d127c1cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for llama_index_readers_kreuzberg-0.1.1-py3-none-any.whl:

Publisher: publish-reader.yaml on kreuzberg-dev/llama-index-kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page