Utility functions for managing and processing chonkie Chunk objects

Project description

chonkie-chunk-utils

How about sort, merge, and formatting for your RAG?

chonkie-chunk-utils is a chunk management utility library for RAG (Retrieval-Augmented Generation) systems. It provides functions for sorting, merging, formatting, and rendering document chunks to create optimal context formats that are easy for LLMs to understand.

⚠️ Research Purpose: This library is currently intended for research purposes. While it is functional and tested, it may undergo significant changes as research progresses. Use with caution in production environments.

Core Feature: Formatting & Rendering

This is what we provide. Transform raw chunks into LLM-friendly, structured formats that enhance understanding and reduce token waste.

Why Formatting & Rendering Matter

Unstructured chunks lack the organization and formatting that LLMs need to effectively understand relationships, boundaries, and metadata. Our formatting and rendering functions transform unstructured chunks into structured formats (XML, JSON, TOON) that LLMs can easily parse and reason about.

Quick Example

from chonkie import Chunk
from chonkie_chunk_utils import render_chunks, jsonify

chunks = [
    Chunk(start_index=0, end_index=5, text="Hello", context="doc1"),
    Chunk(start_index=20, end_index=25, text="world", context="doc1"),
]

# Default: XML format
result = render_chunks(chunks)
# Output:
# <chunk start_index="0" end_index="5" context="doc1">Hello</chunk>
# [...]
# <chunk start_index="20" end_index="25" context="doc1">world</chunk>

# JSON format
result = render_chunks(chunks, format_fn=jsonify)
# Output:
# {"start_index": 0, "context": "doc1", "end_index": 5, "content": "Hello"}
# [...]
# {"start_index": 20, "context": "doc1", "end_index": 25, "content": "world"}

Use Case: Reranking Pipeline

This is how you can use it. While reranking itself is not provided by this library, we provide the essential preprocessing and formatting functions that make reranking pipelines work seamlessly.

Why This Approach Works Better

📝 Research Hypothesis: When reranking is performed on sorted, merged chunks (organized chunk sets with overlapping adjacent chunks resolved), embedding models and reranking models can capture semantic meaning more effectively compared to using raw, unprocessed chunks.

Note: This is currently a research hypothesis that we are investigating. The benefits described below are theoretical and require empirical validation.

Theoretical benefits:

Better semantic coherence: Merged chunks represent continuous, coherent text segments rather than fragmented pieces
Improved context understanding: Sorting ensures chunks follow document order, preserving logical flow
Enhanced embedding quality: Clean, organized chunks allow models to better understand relationships and boundaries
More accurate reranking: Reranking models receive well-structured input, leading to better relevance judgments

Complete Workflow

from chonkie import Chunk
from chonkie_chunk_utils import sort_chunks, merge_adjacent_chunks, render_chunks

# Step 1: Retrieve chunks from vector search
chunks = [Chunk(...), ...]  # Retrieved from your vector DB

# Step 2: Sort by document position
sorted_chunks = sort_chunks(chunks)

# Step 3: Merge adjacent/overlapping chunks
merged_chunks = merge_adjacent_chunks(sorted_chunks)

# Step 4: Rerank (your custom logic)
reranked_chunks = your_reranking_function(merged_chunks)

# Step 5: Render for LLM (default: XML format)
result = render_chunks(reranked_chunks)
# Ready to send to your LLM!

Features

Sorting: Sort chunks by start_index to ensure proper document order
Merging: Automatically merge adjacent or overlapping chunks to remove duplicates
Formatting: Convert chunks to various formats (XML, JSON, TOON)
Rendering: Render chunks into LLM-friendly single strings with customizable separators

Installation

pip install chonkie-chunk-utils

Or using rye:

rye add chonkie-chunk-utils

Documentation

📚 Full documentation is available at: https://devcomfort.github.io/chonkie-chunk-utils/

Requirements

Python >= 3.8
chonkie >= 1.4.1
typing-extensions >= 4.15.0
loguru >= 0.7.3
toolz >= 1.1.0
toon-python >= 0.1.2

License

Please check the LICENSE file in the repository.

Author

devcomfort (im@devcomfort.me)

Project details

Release history Release notifications | RSS feed

This version

0.2.0

Nov 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chonkie_chunk_utils-0.2.0.tar.gz (58.1 kB view details)

Uploaded Nov 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chonkie_chunk_utils-0.2.0-py3-none-any.whl (29.9 kB view details)

Uploaded Nov 6, 2025 Python 3

File details

Details for the file chonkie_chunk_utils-0.2.0.tar.gz.

File metadata

Download URL: chonkie_chunk_utils-0.2.0.tar.gz
Upload date: Nov 6, 2025
Size: 58.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for chonkie_chunk_utils-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a05aef13bc0e6181f5da6c629e94d6fe86f89fb6500c8fdb4c3ed441e7a5dac5`
MD5	`9e7d3a664104918cc103b9123ad499d3`
BLAKE2b-256	`9fa4d30c46798d9fde2969080fc1f2fc04422c202f4541e84d299ae1dac76999`

See more details on using hashes here.

File details

Details for the file chonkie_chunk_utils-0.2.0-py3-none-any.whl.

File metadata

Download URL: chonkie_chunk_utils-0.2.0-py3-none-any.whl
Upload date: Nov 6, 2025
Size: 29.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for chonkie_chunk_utils-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`900510c7b9bf5f605874873bf08f6d14487011732b1d53783d3ecb61cc8860cc`
MD5	`80a36e45925f68aa4cdff420558a24c8`
BLAKE2b-256	`b8b084ed0f6e543e5ac39ab64c07b83fda6cfccddc2a7861c73ecc6cb665d3cd`

See more details on using hashes here.

chonkie-chunk-utils 0.2.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Project description

chonkie-chunk-utils

Core Feature: Formatting & Rendering

Why Formatting & Rendering Matter

Quick Example

Use Case: Reranking Pipeline

Why This Approach Works Better

Complete Workflow

Features

Installation

Documentation

Requirements

License

Author

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes