Skip to main content

Element-aware LlamaIndex node parser for kreuzberg-extracted documents

Project description

LlamaIndex Node Parser Kreuzberg

Kreuzberg Banner

Element-aware LlamaIndex node parser for kreuzberg-extracted documents.

Installation

pip install llama-index-node-parser-kreuzberg

Requires llama-index-core>=0.13.0,<0.15. This package does not depend on kreuzberg directly — the kreuzberg package is a dependency of the reader (llama-index-readers-kreuzberg), which is needed for producing documents with element metadata.

Prerequisites

This parser requires documents with _kreuzberg_elements metadata. These are produced by KreuzbergReader configured with element-based extraction. Install llama-index-readers-kreuzberg (which brings in kreuzberg) to use the full workflow.

from kreuzberg import ExtractionConfig
from llama_index.readers.kreuzberg import KreuzbergReader

reader = KreuzbergReader(
    extraction_config=ExtractionConfig(result_format="element_based")
)
documents = reader.load_data("report.pdf")

Features

  • Element-aware splitting — headings, paragraphs, tables, and code blocks each become a node
  • Element type metadata preserved on each node (element_type, page_number, element_index)
  • Source document relationships tracked via NodeRelationship.SOURCE
  • Graceful degradation — documents without elements pass through with a warning
  • Composes with other transformations (e.g., SentenceSplitter for further chunking)
  • Async support via aget_nodes_from_documents
  • Serialization support (to_dict / from_dict)

Usage

Basic

Full reader-to-nodes flow:

from kreuzberg import ExtractionConfig
from llama_index.readers.kreuzberg import KreuzbergReader
from llama_index.node_parser.kreuzberg import KreuzbergNodeParser

reader = KreuzbergReader(
    extraction_config=ExtractionConfig(result_format="element_based")
)
documents = reader.load_data("report.pdf")

parser = KreuzbergNodeParser()
nodes = parser.get_nodes_from_documents(documents)

IngestionPipeline

Chain with SentenceSplitter for further chunking of large elements:

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter

pipeline = IngestionPipeline(
    transformations=[
        KreuzbergNodeParser(),
        SentenceSplitter(chunk_size=512),  # Further split large elements
    ]
)
nodes = pipeline.run(documents=documents)

VectorStoreIndex

Using the transformations parameter:

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    documents,
    transformations=[KreuzbergNodeParser()],
)

Async

nodes = await parser.aget_nodes_from_documents(documents)

Behavior Notes

  • Documents without _kreuzberg_elements metadata pass through unchanged with a warning. This is intentional — silently falling back would prevent users from noticing they are not getting element-aware splitting.
  • Empty or whitespace-only elements are automatically skipped.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_node_parser_kreuzberg-0.1.0.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file llama_index_node_parser_kreuzberg-0.1.0.tar.gz.

File metadata

File hashes

Hashes for llama_index_node_parser_kreuzberg-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6941aacb7e44ff8119ab6c8566984623b0eb5e1abff8b47b975bf5050ac39b86
MD5 c3af80166a425e9aa736604b7eb64849
BLAKE2b-256 1be8debc70a91133ffaf5218a5c439c7a65188c5063134631ce94a79d88e50ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for llama_index_node_parser_kreuzberg-0.1.0.tar.gz:

Publisher: publish-node-parser.yaml on kreuzberg-dev/llama-index-kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llama_index_node_parser_kreuzberg-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_node_parser_kreuzberg-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3b768fe93b9fd0f456a780dedb733e89c01441753e815ed55017af981367a8fa
MD5 de3521d6100ad6662afdc131cafb3c06
BLAKE2b-256 e2b2584f46cbafd0abaa99fbeb83559d9266398a6ad11543745ec903fbcaa84a

See more details on using hashes here.

Provenance

The following attestation bundles were made for llama_index_node_parser_kreuzberg-0.1.0-py3-none-any.whl:

Publisher: publish-node-parser.yaml on kreuzberg-dev/llama-index-kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page