Element-aware LlamaIndex node parser for kreuzberg-extracted documents
Project description
LlamaIndex Node Parser Kreuzberg
Element-aware LlamaIndex node parser for kreuzberg-extracted documents.
Installation
pip install llama-index-node-parser-kreuzberg
Requires llama-index-core>=0.13.0,<0.15. This package does not depend on
kreuzberg directly — the kreuzberg package is a dependency of the reader
(llama-index-readers-kreuzberg), which is needed for producing documents with
element metadata.
Prerequisites
This parser requires documents with
_kreuzberg_elementsmetadata. These are produced byKreuzbergReaderconfigured with element-based extraction. Installllama-index-readers-kreuzberg(which brings inkreuzberg) to use the full workflow.
from kreuzberg import ExtractionConfig
from llama_index.readers.kreuzberg import KreuzbergReader
reader = KreuzbergReader(
extraction_config=ExtractionConfig(result_format="element_based")
)
documents = reader.load_data("report.pdf")
Features
- Element-aware splitting — headings, paragraphs, tables, and code blocks each become a node
- Element type metadata preserved on each node (
element_type,page_number,element_index) - Source document relationships tracked via
NodeRelationship.SOURCE - Graceful degradation — documents without elements pass through with a warning
- Composes with other transformations (e.g.,
SentenceSplitterfor further chunking) - Async support via
aget_nodes_from_documents - Serialization support (
to_dict/from_dict)
Usage
Basic
Full reader-to-nodes flow:
from kreuzberg import ExtractionConfig
from llama_index.readers.kreuzberg import KreuzbergReader
from llama_index.node_parser.kreuzberg import KreuzbergNodeParser
reader = KreuzbergReader(
extraction_config=ExtractionConfig(result_format="element_based")
)
documents = reader.load_data("report.pdf")
parser = KreuzbergNodeParser()
nodes = parser.get_nodes_from_documents(documents)
IngestionPipeline
Chain with SentenceSplitter for further chunking of large elements:
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
pipeline = IngestionPipeline(
transformations=[
KreuzbergNodeParser(),
SentenceSplitter(chunk_size=512), # Further split large elements
]
)
nodes = pipeline.run(documents=documents)
VectorStoreIndex
Using the transformations parameter:
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(
documents,
transformations=[KreuzbergNodeParser()],
)
Async
nodes = await parser.aget_nodes_from_documents(documents)
Behavior Notes
- Documents without
_kreuzberg_elementsmetadata pass through unchanged with a warning. This is intentional — silently falling back would prevent users from noticing they are not getting element-aware splitting. - Empty or whitespace-only elements are automatically skipped.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llama_index_node_parser_kreuzberg-0.1.0.tar.gz.
File metadata
- Download URL: llama_index_node_parser_kreuzberg-0.1.0.tar.gz
- Upload date:
- Size: 4.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6941aacb7e44ff8119ab6c8566984623b0eb5e1abff8b47b975bf5050ac39b86
|
|
| MD5 |
c3af80166a425e9aa736604b7eb64849
|
|
| BLAKE2b-256 |
1be8debc70a91133ffaf5218a5c439c7a65188c5063134631ce94a79d88e50ed
|
Provenance
The following attestation bundles were made for llama_index_node_parser_kreuzberg-0.1.0.tar.gz:
Publisher:
publish-node-parser.yaml on kreuzberg-dev/llama-index-kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llama_index_node_parser_kreuzberg-0.1.0.tar.gz -
Subject digest:
6941aacb7e44ff8119ab6c8566984623b0eb5e1abff8b47b975bf5050ac39b86 - Sigstore transparency entry: 1154401024
- Sigstore integration time:
-
Permalink:
kreuzberg-dev/llama-index-kreuzberg@7084e3b62befa664f4c159f1951430c03d3b744f -
Branch / Tag:
refs/tags/node-parser-v0.1.0 - Owner: https://github.com/kreuzberg-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-node-parser.yaml@7084e3b62befa664f4c159f1951430c03d3b744f -
Trigger Event:
push
-
Statement type:
File details
Details for the file llama_index_node_parser_kreuzberg-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llama_index_node_parser_kreuzberg-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b768fe93b9fd0f456a780dedb733e89c01441753e815ed55017af981367a8fa
|
|
| MD5 |
de3521d6100ad6662afdc131cafb3c06
|
|
| BLAKE2b-256 |
e2b2584f46cbafd0abaa99fbeb83559d9266398a6ad11543745ec903fbcaa84a
|
Provenance
The following attestation bundles were made for llama_index_node_parser_kreuzberg-0.1.0-py3-none-any.whl:
Publisher:
publish-node-parser.yaml on kreuzberg-dev/llama-index-kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llama_index_node_parser_kreuzberg-0.1.0-py3-none-any.whl -
Subject digest:
3b768fe93b9fd0f456a780dedb733e89c01441753e815ed55017af981367a8fa - Sigstore transparency entry: 1154401026
- Sigstore integration time:
-
Permalink:
kreuzberg-dev/llama-index-kreuzberg@7084e3b62befa664f4c159f1951430c03d3b744f -
Branch / Tag:
refs/tags/node-parser-v0.1.0 - Owner: https://github.com/kreuzberg-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-node-parser.yaml@7084e3b62befa664f4c159f1951430c03d3b744f -
Trigger Event:
push
-
Statement type: