Skip to main content

Utility functions for managing and processing chonkie Chunk objects

Project description

chonkie-chunk-utils

How about sort, merge, and formatting for your RAG?

chonkie-chunk-utils is a chunk management utility library for RAG (Retrieval-Augmented Generation) systems. It provides functions for sorting, merging, formatting, and rendering document chunks to create optimal context formats that are easy for LLMs to understand.

⚠️ Research Purpose: This library is currently intended for research purposes. While it is functional and tested, it may undergo significant changes as research progresses. Use with caution in production environments.

Core Feature: Formatting & Rendering

This is what we provide. Transform raw chunks into LLM-friendly, structured formats that enhance understanding and reduce token waste.

Why Formatting & Rendering Matter

Unstructured chunks lack the organization and formatting that LLMs need to effectively understand relationships, boundaries, and metadata. Our formatting and rendering functions transform unstructured chunks into structured formats (XML, JSON, TOON) that LLMs can easily parse and reason about.

Quick Example

from chonkie import Chunk
from chonkie_chunk_utils import render_chunks, jsonify

chunks = [
    Chunk(start_index=0, end_index=5, text="Hello", context="doc1"),
    Chunk(start_index=20, end_index=25, text="world", context="doc1"),
]

# Default: XML format
result = render_chunks(chunks)
# Output:
# <chunk start_index="0" end_index="5" context="doc1">Hello</chunk>
# [...]
# <chunk start_index="20" end_index="25" context="doc1">world</chunk>

# JSON format
result = render_chunks(chunks, format_fn=jsonify)
# Output:
# {"start_index": 0, "context": "doc1", "end_index": 5, "content": "Hello"}
# [...]
# {"start_index": 20, "context": "doc1", "end_index": 25, "content": "world"}

Use Case: Reranking Pipeline

This is how you can use it. While reranking itself is not provided by this library, we provide the essential preprocessing and formatting functions that make reranking pipelines work seamlessly.

Why This Approach Works Better

📝 Research Hypothesis: When reranking is performed on sorted, merged chunks (organized chunk sets with overlapping adjacent chunks resolved), embedding models and reranking models can capture semantic meaning more effectively compared to using raw, unprocessed chunks.

Note: This is currently a research hypothesis that we are investigating. The benefits described below are theoretical and require empirical validation.

Theoretical benefits:

  • Better semantic coherence: Merged chunks represent continuous, coherent text segments rather than fragmented pieces
  • Improved context understanding: Sorting ensures chunks follow document order, preserving logical flow
  • Enhanced embedding quality: Clean, organized chunks allow models to better understand relationships and boundaries
  • More accurate reranking: Reranking models receive well-structured input, leading to better relevance judgments

Complete Workflow

from chonkie import Chunk
from chonkie_chunk_utils import sort_chunks, merge_adjacent_chunks, render_chunks

# Step 1: Retrieve chunks from vector search
chunks = [Chunk(...), ...]  # Retrieved from your vector DB

# Step 2: Sort by document position
sorted_chunks = sort_chunks(chunks)

# Step 3: Merge adjacent/overlapping chunks
merged_chunks = merge_adjacent_chunks(sorted_chunks)

# Step 4: Rerank (your custom logic)
reranked_chunks = your_reranking_function(merged_chunks)

# Step 5: Render for LLM (default: XML format)
result = render_chunks(reranked_chunks)
# Ready to send to your LLM!

Features

  • Sorting: Sort chunks by start_index to ensure proper document order
  • Merging: Automatically merge adjacent or overlapping chunks to remove duplicates
  • Formatting: Convert chunks to various formats (XML, JSON, TOON)
  • Rendering: Render chunks into LLM-friendly single strings with customizable separators

Installation

pip install chonkie-chunk-utils

Or using rye:

rye add chonkie-chunk-utils

Documentation

📚 Full documentation is available at: https://devcomfort.github.io/chonkie-chunk-utils/

Requirements

  • Python >= 3.8
  • chonkie >= 1.4.1
  • typing-extensions >= 4.15.0
  • loguru >= 0.7.3
  • toolz >= 1.1.0
  • toon-python >= 0.1.2

License

Please check the LICENSE file in the repository.

Author

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chonkie_chunk_utils-0.2.0.tar.gz (58.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chonkie_chunk_utils-0.2.0-py3-none-any.whl (29.9 kB view details)

Uploaded Python 3

File details

Details for the file chonkie_chunk_utils-0.2.0.tar.gz.

File metadata

  • Download URL: chonkie_chunk_utils-0.2.0.tar.gz
  • Upload date:
  • Size: 58.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for chonkie_chunk_utils-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a05aef13bc0e6181f5da6c629e94d6fe86f89fb6500c8fdb4c3ed441e7a5dac5
MD5 9e7d3a664104918cc103b9123ad499d3
BLAKE2b-256 9fa4d30c46798d9fde2969080fc1f2fc04422c202f4541e84d299ae1dac76999

See more details on using hashes here.

File details

Details for the file chonkie_chunk_utils-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for chonkie_chunk_utils-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 900510c7b9bf5f605874873bf08f6d14487011732b1d53783d3ecb61cc8860cc
MD5 80a36e45925f68aa4cdff420558a24c8
BLAKE2b-256 b8b084ed0f6e543e5ac39ab64c07b83fda6cfccddc2a7861c73ecc6cb665d3cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page