Intelligent text and document chunking library with automatic format detection and AI-powered metadata generation for RAG, LLMs, and text processing pipelines
Project description
🤝 Chunkmate
Chunkmate is your friendly companion for splitting text into manageable chunks!
A Python library for intelligently splitting text and documents into token-aware chunks. Perfect for processing different type of documents for RAG (Retrieval-Augmented Generation) systems, LLM applications, and any text processing pipeline that needs smart chunking.
🚀 Simple. Powerful. Just 2 Lines.
chunker = Chunker()
chunks = chunker.split(Path("document.md")) # Works with .md, .json, .html, .txt files or raw text!
That's it! Split any document (Markdown, JSON, HTML, or plain text) with automatic format detection and intelligent chunking. Pass a file Path or raw text string. No configuration needed. No complexity. Just results.
Features
- 🎯 Smart Text Splitting: Intelligently splits text while respecting natural boundaries (paragraphs, sentences, etc.)
- 📝 Multiple Format Support: Built-in splitters for common document types:
- Markdown (
.md,.mdx,.markdown) - Header-aware splitting that respects document structure - JSON (
.json) - Recursive splitting while preserving valid JSON structure - HTML (
.html,.htm) - Converts to Markdown and splits by headers (h2, h3) - Plain Text (
.txt) - Recursive character-based splitting with smart boundaries
- Markdown (
- 🤖 Auto-Detection: Automatically detects text format (JSON, Markdown, HTML, plain text) when no file extension is available
- 🔢 Token-Aware: Counts tokens using tiktoken to ensure chunks fit within model limits
- 🧩 Extensible: Easy to add custom splitters for different document types
- 🎨 Metadata Generation: Optional metadata enrichment for each chunk (custom or AI-powered) with support for multiple generators to improve RAG retrieval with context, keywords, summaries, etc.
- ⚡ Async Support:
AsyncChunkerfor concurrent metadata generation with rate limiting - 📁 File or String: Works with both file paths and raw text strings
- 🎛️ Configurable: Customize chunk size and splitting behavior
- ✅ Fully Tested: 100% test coverage with comprehensive unit tests
Installation
pip install chunkmate
Or using uv:
uv add chunkmate
Optional Dependencies:
For framework integrations, install with optional dependencies:
# With LangChain support
pip install chunkmate[langchain]
# With LlamaIndex support
pip install chunkmate[llamaindex]
# With all optional dependencies
pip install chunkmate[all]
Quick Start
Basic Usage
from chunkmate import Chunker
# Initialize the chunker
chunker = Chunker()
# Split text directly
text = "Your long document text here..."
chunks = chunker.split(text)
# Or split from a file
from pathlib import Path
chunks = chunker.split(Path("document.md"))
# Add custom metadata to all chunks
chunks = chunker.split(text, metadata={"source": "user_input", "category": "docs"})
# Access chunk data
for chunk in chunks:
print(f"Text: {chunk.text}")
print(f"Tokens: {chunk.token_count}")
print(f"Metadata: {chunk.metadata}")
Metadata Generation
Enrich chunks with AI-generated metadata for better retrieval in RAG systems:
from chunkmate import Chunker, MetadataGenerator
# Implement your metadata generator
class MyMetadataGenerator:
def generate_metadata(self, text: str, chunk_text: str) -> dict:
# Use your LLM to generate metadata for this chunk
# text: the full document text
# chunk_text: the specific chunk text
# Return a dictionary with any metadata fields you want
return {
"context": f"This chunk discusses {chunk_text[:50]}...",
"doc_topic": text[:50]
}
# Create chunker with metadata generation
chunker = Chunker(metadata_generators=MyMetadataGenerator())
# Split text - metadata is automatically added to each chunk
chunks = chunker.split("Your document text")
# Each chunk now has generated metadata
for chunk in chunks:
print(f"Text: {chunk.text}")
print(f"Context: {chunk.metadata['context']}")
print(f"Topic: {chunk.metadata['doc_topic']}")
# Use multiple generators to add different types of metadata
class KeywordGenerator:
def generate_metadata(self, text: str, chunk_text: str) -> dict:
# Extract keywords from chunk
keywords = extract_keywords(chunk_text)
return {"keywords": keywords, "keyword_count": len(keywords)}
chunker = Chunker(metadata_generators=[MyMetadataGenerator(), KeywordGenerator()])
chunks = chunker.split("Your document text")
# Each chunk has metadata from both generators
for chunk in chunks:
print(f"Context: {chunk.metadata['context']}")
print(f"Keywords: {chunk.metadata['keywords']}")
# Combine with custom metadata
chunks = chunker.split(
"Your document text",
metadata={"source": "user_input", "category": "docs"}
)
# Each chunk will have custom metadata + generated metadata
Async Metadata Generation (Concurrent Processing)
For high-performance applications, use AsyncChunker to generate metadata for all chunks concurrently:
from chunkmate import AsyncChunker, AsyncMetadataGenerator
# Implement your async metadata generator
class MyAsyncMetadataGenerator:
def get_max_concurrency(self) -> int | None:
return 5 # Limit to 5 concurrent API calls
async def generate_metadata(self, text: str, chunk_text: str) -> dict:
# Use your async LLM client to generate metadata
# AsyncChunker will call this method concurrently for all chunks
response = await your_llm_client.generate(
prompt=f"Summarize this chunk: {chunk_text}"
)
return {
"context": response.text,
"summary": response.summary
}
# Create async chunker with concurrent metadata generation
chunker = AsyncChunker(metadata_generators=MyAsyncMetadataGenerator())
# Split text - metadata generation happens concurrently for all chunks
# This provides significant speedup for N chunks when calling I/O-bound APIs
async def process_document():
chunks = await chunker.split("Your document text")
return chunks
# With rate limiting to avoid API throttling (recommended)
# Control concurrency via get_max_concurrency() in your metadata generator
class RateLimitedGenerator:
def get_max_concurrency(self) -> int | None:
return 5 # Max 5 concurrent API calls
async def generate_metadata(self, text: str, chunk_text: str) -> dict:
summary = await call_llm_api(chunk_text)
return {"summary": summary}
chunker = AsyncChunker(metadata_generators=RateLimitedGenerator())
# This ensures you don't hit rate limits on LLM APIs
chunks = await chunker.split(Path("large_document.md"))
# Use multiple generators - each processes all chunks concurrently
class AsyncKeywordGenerator:
def get_max_concurrency(self) -> int | None:
return 10 # Can handle more concurrent calls
async def generate_metadata(self, text: str, chunk_text: str) -> dict:
keywords = await extract_keywords_async(chunk_text)
return {"keywords": keywords, "keyword_count": len(keywords)}
chunker = AsyncChunker(
metadata_generators=[MyAsyncMetadataGenerator(), AsyncKeywordGenerator()]
)
# All generators run sequentially, but each processes all chunks concurrently
# Each generator's max_concurrency is controlled by its get_max_concurrency() method
chunks = await chunker.split("Your document text")
# Access all generated metadata
for chunk in chunks:
print(f"Context: {chunk.metadata['context']}")
print(f"Keywords: {chunk.metadata['keywords']}")
Why AsyncChunker?
- Concurrent Processing: Metadata generation for all chunks happens concurrently
- Significant Speedup: Process multiple chunks simultaneously—typically 5-10× faster for I/O-bound operations like LLM API calls
- Rate Limiting: Built-in
max_concurrencyparameter to prevent API throttling - Multiple Generators: Support for multiple metadata generators, each processing all chunks concurrently
- Error Handling: Robust error handling with
ExceptionGroupfor failed chunks - Perfect for LLM APIs: Ideal when calling OpenAI, Anthropic, or other async LLM services
Real-World Example: OpenAI Context Generator
Here's a complete, production-ready example using OpenAI's API to generate contextual descriptions for chunks:
import os
import logging
from typing import Any
from openai import AsyncOpenAI
from openai import RateLimitError, APIError
from chunkmate import AsyncChunker, AsyncMetadataGenerator
class AsyncOpenAIContextGenerator:
def get_max_concurrency(self) -> int | None:
return self.max_concurrency
def __init__(
self, api_key: str | None = None, max_concurrency: int | None = 5
) -> None:
self.api_key: str = api_key if api_key else os.getenv("OPENAI_API_KEY", "")
if not self.api_key:
raise ValueError("OPENAI_API_KEY environment variable is not set")
# Configure with more retries for rate limit handling (default is 2)
self.client: AsyncOpenAI = AsyncOpenAI(api_key=self.api_key, max_retries=5)
self.model: str = "gpt-5-nano"
self.max_concurrency: int | None = max_concurrency
async def generate_metadata(self, text: str, chunk_text: str) -> dict[str, Any]:
"""
Generate context metadata for a chunk using the OpenAI API.
Uses prompt caching by passing the complete document text, which is cached
across multiple requests for the same document, reducing costs significantly.
Automatically retries up to 5 times on rate limits and API errors with
exponential backoff. Returns an empty context payload on failure for
graceful degradation.
Args:
text: The full document text.
chunk_text: The specific chunk text to generate context for.
Returns:
dict[str, Any]: A dictionary with the 'context' key containing a brief
2-3 sentence description, or an empty string if generation fails.
"""
# Structure prompt to maximize caching: put the full document first (cached),
# then the variable part (the specific chunk) last
system_message = """You are a helpful assistant that generates brief, contextual descriptions for text chunks. Given a complete document and a specific chunk from it, provide a brief, succinct context (2-3 sentences max) that describes what this chunk is about and how it relates to the overall document."""
user_message = f"""Complete Document:
{text}
---
For the following chunk from the above document, provide a brief context:
Chunk:
{chunk_text}
Context:"""
try:
response = await self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": user_message},
],
)
content = response.choices[0].message.content
if content is None:
logging.warning("OpenAI returned None content")
return {"context": ""}
return {"context": content.strip()}
except RateLimitError as e:
logging.warning(f"Rate limit hit: {e}")
return {"context": ""}
except APIError as e:
logging.error(f"OpenAI API error: {e}")
return {"context": ""}
except Exception as e:
logging.error(f"Unexpected error generating context: {e}", exc_info=True)
return {"context": ""}
class OpenAIContextGenerator:
def __init__(self) -> None:
self.api_key: str = os.getenv("OPENAI_API_KEY", "")
if not self.api_key:
raise ValueError("OPENAI_API_KEY environment variable is not set")
# Configure with more retries for rate limit handling (default is 2)
self.client: OpenAI = OpenAI(api_key=self.api_key, max_retries=5)
self.model: str = "gpt-5-nano"
def generate_metadata(self, text: str, chunk_text: str) -> dict[str, Any]:
"""
Generate context metadata for a chunk using the OpenAI API.
Uses prompt caching by passing the complete document text, which is cached
across multiple requests for the same document, reducing costs significantly.
Automatically retries up to 5 times on rate limits and API errors with
exponential backoff. Returns an empty context payload on failure for
graceful degradation.
Args:
text: The full document text.
chunk_text: The specific chunk text to generate context for.
Returns:
dict[str, Any]: A dictionary with the 'context' key containing a brief
2-3 sentence description, or an empty string if generation fails.
"""
# Structure prompt to maximize caching: put the full document first (cached),
# then the variable part (the specific chunk) last
system_message = """You are a helpful assistant that generates brief, contextual descriptions for text chunks. Given a complete document and a specific chunk from it, provide a brief, succinct context (2-3 sentences max) that describes what this chunk is about and how it relates to the overall document."""
user_message = f"""Complete Document:
{text}
---
For the following chunk from the above document, provide a brief context:
Chunk:
{chunk_text}
Context:"""
try:
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": user_message},
],
)
content = response.choices[0].message.content
if content is None:
logging.warning("OpenAI returned None content")
return {"context": ""}
return {"context": content.strip()}
except RateLimitError as e:
logging.warning(f"Rate limit hit: {e}")
return {"context": ""}
except APIError as e:
logging.error(f"OpenAI API error: {e}")
return {"context": ""}
except Exception as e:
logging.error(f"Unexpected error generating context: {e}", exc_info=True)
return {"context": ""}
# Usage example (Async)
async def process_document_with_openai_async():
# Create the context generator
# Rate limiting is controlled via get_max_concurrency() in the generator
openai_generator = AsyncOpenAIContextGenerator()
# Create chunker with OpenAI metadata generation
chunker = AsyncChunker(
metadata_generators=openai_generator
)
# Process your document
chunks = await chunker.split("Your long document text here...")
# Each chunk now has AI-generated context
for chunk in chunks:
print(f"Chunk: {chunk.text[:100]}...")
print(f"Context: {chunk.metadata['context']}")
print("---")
return chunks
# Run the async function
import asyncio
asyncio.run(process_document_with_openai_async())
# Usage example (Sync)
def process_document_with_openai_sync():
# Create the synchronous context generator
openai_generator = OpenAIContextGenerator()
# Create chunker with OpenAI metadata generation
chunker = Chunker(
metadata_generators=openai_generator
)
# Process your document
chunks = chunker.split("Your long document text here...")
# Each chunk now has AI-generated context
for chunk in chunks:
print(f"Chunk: {chunk.text[:100]}...")
print(f"Context: {chunk.metadata['context']}")
print("---")
return chunks
# Run the sync function
process_document_with_openai_sync()
Key Features:
- Prompt Caching: Structures prompts to maximize OpenAI's caching benefits by putting the full document first (cached across requests)
- Automatic Retries: Configures 5 retries with exponential backoff for rate limits and API errors
- Graceful Degradation: Returns empty context on failure instead of crashing
- Cost Efficient: Using
gpt-5-nanofor cost-effective context generation - Rate Limit Protection: Async version uses
max_concurrency=5to prevent overwhelming the API - Production Ready: Includes comprehensive error handling and logging
- Flexible: Choose async for concurrent processing or sync for simpler sequential processing
Cost Optimization Tips:
- The prompt structure maximizes caching - the full document is cached, reducing costs for subsequent chunks
- Adjust
max_concurrencybased on your OpenAI rate limits (async version only)
Auto-Detection of Text Format
When splitting text directly (without a file extension), Chunkmate automatically detects the format:
from chunkmate import Chunker
chunker = Chunker()
# JSON content is automatically detected
json_text = '{"users": [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]}'
chunks = chunker.split(json_text)
# Uses JSONSplitter automatically
# HTML content is automatically detected
html_text = """
<html>
<body>
<h1>Welcome</h1>
<p>This is HTML content.</p>
</body>
</html>
"""
chunks = chunker.split(html_text)
# Uses HTMLSplitter automatically
# Markdown content is automatically detected
markdown_text = """
## Introduction
This is a markdown document.
### Features
- Feature 1
- Feature 2
"""
chunks = chunker.split(markdown_text)
# Uses MarkdownSplitter automatically
# Plain text is handled by TextSplitter
plain_text = "This is just plain text without any special formatting."
chunks = chunker.split(plain_text)
# Uses TextSplitter automatically
Custom Splitters
Create your own splitter for specific document types:
from chunkmate import Chunker, Splitter, Chunk
import csv
from io import StringIO
class CSVSplitter(Splitter):
def __init__(self, max_chunk_size: int = 512, rows_per_chunk: int = 100):
self.max_chunk_size = max_chunk_size
self.rows_per_chunk = rows_per_chunk
def split(self, text: str) -> list[Chunk]:
"""Split CSV into chunks, each containing a batch of rows with header."""
reader = csv.reader(StringIO(text))
rows = list(reader)
if not rows:
return []
header = rows[0]
data_rows = rows[1:]
chunks = []
# Split data into batches, each with the header
for i in range(0, len(data_rows), self.rows_per_chunk):
batch = [header] + data_rows[i:i + self.rows_per_chunk]
chunk_text = '\n'.join(','.join(row) for row in batch)
chunks.append(Chunk(
text=chunk_text,
token_count=len(chunk_text.split()),
metadata={"rows": len(batch) - 1, "has_header": True}
))
return chunks
def can_handle(self, text: str) -> bool:
"""Detect if text is CSV format by checking for comma-separated structure."""
try:
lines = text.strip().split('\n')
if len(lines) < 2:
return False
# Check if at least 80% of lines have the same number of commas
comma_counts = [line.count(',') for line in lines[:10]]
if not comma_counts or comma_counts[0] == 0:
return False
consistent = sum(1 for c in comma_counts if c == comma_counts[0])
return consistent / len(comma_counts) >= 0.8
except:
return False
# Register your custom splitter
chunker = Chunker(
splitters={"csv": CSVSplitter(rows_per_chunk=50)},
extension_map={"csv": "csv", "tsv": "csv"}
)
# Now .csv files will use your custom splitter
chunks = chunker.split(Path("data.csv"))
# CSV text is auto-detected too!
csv_text = """name,age,city
Alice,30,NYC
Bob,25,LA
Charlie,35,Chicago"""
chunks = chunker.split(csv_text) # Automatically uses CSVSplitter
print(f"Split into {len(chunks)} chunks")
print(f"First chunk has {chunks[0].metadata['rows']} rows")
Important: The order of splitters matters for auto-detection. When registering custom splitters, more specialized splitters should be added before general-purpose ones. TextSplitter should always be last as it accepts any text format.
Configure Chunk Size
from chunkmate import TextSplitter, MarkdownSplitter, Chunker
# Create splitters with custom chunk sizes
text_splitter = TextSplitter(max_chunk_size=1024)
markdown_splitter = MarkdownSplitter(max_chunk_size=1024)
# Use in chunker
chunker = Chunker(
splitters={
"text": text_splitter,
"markdown": markdown_splitter,
}
)
Custom Token Counting
By default, Chunkmate uses o200k_base encoding (for GPT-4o/GPT-4o-mini or newer models). For older models, provide a custom token counter:
import tiktoken
from chunkmate import TextSplitter
# For GPT-4, GPT-3.5-turbo, or text-embedding models
def cl100k_counter(text: str) -> int:
encoding = tiktoken.get_encoding("cl100k_base")
return len(encoding.encode(text))
splitter = TextSplitter(
max_chunk_size=512,
count_tokens=cl100k_counter
)
# Or use a simple word-based counter
def word_counter(text: str) -> int:
return len(text.split())
splitter = TextSplitter(count_tokens=word_counter)
Encoding Compatibility:
o200k_base(default): GPT-4o, GPT-4o-mini, GPT-5, GPT-5-mini, GPT-5-nanocl100k_base: GPT-4, GPT-4-turbo, GPT-3.5-turbo, text-embedding-3-*p50k_base: GPT-3 models (Davinci, Curie, etc.)
Convert to Other Formats
Chunkmate provides converters to integrate with popular frameworks:
from chunkmate import Chunker, chunks_to_langchain, chunks_to_llamaindex, chunks_to_dict
chunker = Chunker()
chunks = chunker.split("Your document text here...")
# Convert to LangChain documents
langchain_docs = chunks_to_langchain(chunks)
# Returns: list[langchain_core.documents.Document]
# Convert to LlamaIndex documents
llamaindex_docs = chunks_to_llamaindex(chunks)
# Returns: list[llama_index.core.schema.Document]
# Convert to plain dictionaries (for JSON serialization, databases, etc.)
chunk_dicts = chunks_to_dict(chunks)
# Returns: list[dict] with keys: text, token_count, metadata
# Example output: [
# {'text': 'Hello', 'token_count': 1, 'metadata': {'section': 'intro'}},
# {'text': 'World', 'token_count': 1, 'metadata': {'section': 'body'}}
# ]
Installation with Optional Dependencies:
# Install with LangChain support
pip install chunkmate[langchain]
# Install with LlamaIndex support
pip install chunkmate[llamaindex]
# Install with both LangChain and LlamaIndex
pip install chunkmate[all]
# Using uv
uv add chunkmate[langchain]
uv add chunkmate[llamaindex]
uv add chunkmate[all]
Supported File Types
By default, Chunkmate supports:
- Markdown:
.md,.mdx,.markdown- Uses header-aware splitting - JSON:
.json- Uses recursive JSON splitting while preserving structure - HTML:
.html,.htm- Uses header-aware HTML splitting - Text:
.txt- Uses recursive character splitting
Auto-Detection
When processing raw text strings (no file extension), Chunkmate automatically detects the format:
- JSON Detection: Checks if text is valid JSON (objects or arrays)
- Markdown Detection: Checks for headers (
#), code blocks (```), links, lists, and other markdown patterns (requires 2+ indicators) - HTML Detection: Checks for HTML tags, attributes, document structure (requires 3+ indicators)
- Text Fallback: If no specific format is detected, falls back to plain text splitting
The detection uses a scoring system that requires multiple format indicators to avoid false positives. The order matters: JSON is checked first, then Markdown, then HTML, and finally Text as a universal fallback. This ensures Markdown files with embedded HTML are correctly detected as Markdown.
You can easily add support for more file types using custom splitters and extension mappings. When implementing custom splitters, make sure to implement the can_handle() method for automatic format detection.
API Reference
Chunker
Main class for chunking documents (synchronous).
Chunker(
splitters: dict[str, Splitter] | None = None,
extension_map: dict[str, str] | None = None,
logger: logging.Logger | None = None,
metadata_generators: list[MetadataGenerator] | MetadataGenerator | None = None
)
Parameters:
splitters: Optional dict of custom splitters to add or override defaultsextension_map: Optional dict mapping file extensions to splitter typeslogger: Optional logger for debugging and operation trackingmetadata_generators: Optional MetadataGenerator instance or list of MetadataGenerator instances for enriching chunks with metadata. Multiple generators are applied sequentially, with each adding different metadata fields to the chunks
Methods:
split(text: str | Path, metadata: dict | None = None) -> list[Chunk]: Split text or file into chunks with optional metadata generation
Example:
from chunkmate import Chunker, MetadataGenerator
class ContextGenerator:
def generate_metadata(self, text: str, chunk_text: str) -> dict:
return {"context": f"Summary: {chunk_text[:100]}"}
# Single generator
chunker = Chunker(metadata_generators=ContextGenerator())
chunks = chunker.split("document.md")
# Multiple generators
chunker = Chunker(metadata_generators=[
ContextGenerator(),
KeywordGenerator()
])
chunks = chunker.split("document text")
# Each chunk has metadata from all generators
AsyncChunker
Async version of Chunker with concurrent metadata generation and rate limiting.
AsyncChunker(
splitters: dict[str, Splitter] | None = None,
extension_map: dict[str, str] | None = None,
logger: logging.Logger | None = None,
metadata_generators: list[AsyncMetadataGenerator] | AsyncMetadataGenerator | None = None,
max_concurrency: int | None = None
)
Parameters:
splitters: Optional dict of custom splitters to add or override defaultsextension_map: Optional dict mapping file extensions to splitter typeslogger: Optional logger for debugging and operation trackingmetadata_generators: Optional AsyncMetadataGenerator instance or list of AsyncMetadataGenerator instances for concurrent metadata enrichment. Multiple generators are applied sequentially, with each processing all chunks concurrentlymax_concurrency: Optional maximum number of concurrent metadata generation tasks per generator (recommended: 3-10 for LLM APIs)
Methods:
async split(text: str | Path, metadata: dict | None = None) -> list[Chunk]: Split text or file into chunks with concurrent metadata generation
Key Benefits:
- Concurrent Processing: All chunks are processed concurrently for metadata generation, providing significant speedup when calling I/O-bound services (like LLM APIs)
- Rate Limiting: Use
max_concurrencyto avoid API throttling (applied per generator) - Multiple Generators: Support for multiple metadata generators, each adding different metadata fields
- Error Handling: Uses
ExceptionGroupto report all metadata generation failures
Example:
from chunkmate import AsyncChunker, AsyncMetadataGenerator
class ContextGenerator:
def get_max_concurrency(self) -> int | None:
return 5
async def generate_metadata(self, text: str, chunk_text: str) -> dict:
context = await llm.generate(f"Summarize: {chunk_text}")
return {"context": context}
class KeywordGenerator:
def get_max_concurrency(self) -> int | None:
return 10
async def generate_metadata(self, text: str, chunk_text: str) -> dict:
keywords = await extract_keywords(chunk_text)
return {"keywords": keywords}
# Use multiple generators with rate limiting
chunker = AsyncChunker(
metadata_generators=[ContextGenerator(), KeywordGenerator()],
max_concurrency=5 # Max 5 concurrent API calls per generator
)
chunks = await chunker.split("document.md")
# Each chunk has both 'context' and 'keywords' in metadata
Chunk
Data class representing a text chunk.
Attributes:
text: str: The chunk texttoken_count: int: Number of tokens in the chunkmetadata: dict: Additional metadata (e.g., context, headers, source)
MarkdownSplitter
Splits markdown documents respecting header structure. Automatically detects markdown content by looking for headers, code blocks, lists, and other markdown patterns.
MarkdownSplitter(
max_chunk_size: int = 512,
count_tokens: Callable[[str], int] = default_token_counter
)
Methods:
split(text: str) -> list[Chunk]: Split markdown text into chunkscan_handle(text: str) -> bool: ReturnsTrueif text contains 2+ markdown patterns
JSONSplitter
Splits JSON documents recursively while preserving valid JSON structure in each chunk. Ensures each chunk contains valid, parseable JSON.
JSONSplitter(
max_chunk_size: int = 512,
count_tokens: Callable[[str], int] = default_token_counter
)
Methods:
split(text: str) -> list[Chunk]: Split JSON text into chunks with valid JSON in eachcan_handle(text: str) -> bool: ReturnsTrueif text is valid JSON
HTMLSplitter
Splits HTML documents by headers (h2, h3) while preserving document structure. Automatically detects HTML content by checking for tags, attributes, and document structure.
HTMLSplitter(
max_chunk_size: int = 512,
count_tokens: Callable[[str], int] = default_token_counter
)
Methods:
split(text: str) -> list[Chunk]: Split HTML text into chunks by header boundariescan_handle(text: str) -> bool: ReturnsTrueif text contains 3+ HTML indicators
TextSplitter
Splits plain text using recursive character splitting. Acts as a universal fallback that can handle any text.
TextSplitter(
max_chunk_size: int = 512,
count_tokens: Callable[[str], int] = default_token_counter
)
Methods:
split(text: str) -> list[Chunk]: Split text into chunkscan_handle(text: str) -> bool: Always returnsTrue(accepts any text)
Splitter Protocol
All splitters must implement this protocol:
Methods:
split(text: str) -> list[Chunk]: Split text into chunkscan_handle(text: str) -> bool: Determine if the splitter can handle the given text format
MetadataGenerator Protocol
Protocol for implementing synchronous metadata generators that enrich chunks with additional information.
Attributes:
name: A unique identifier for this generator (e.g., "context", "keywords", "summary")
Methods:
generate_metadata(text: str, chunk_text: str) -> dict: Generate metadata for a chunk. Returns a dictionary that will be merged into the chunk's metadata
Example Implementation:
from chunkmate import MetadataGenerator
class MyMetadataGenerator:
def generate_metadata(self, text: str, chunk_text: str) -> dict:
# Your custom logic here - typically calling an LLM
context = f"This chunk is from a document about: {text[:100]}..."
summary = chunk_text[:50]
return {
"context": context,
"summary": summary,
"length": len(chunk_text)
}
Using Multiple Generators:
class ContextGenerator:
def generate_metadata(self, text: str, chunk_text: str) -> dict:
return {"context": generate_context(chunk_text)}
class KeywordGenerator:
def generate_metadata(self, text: str, chunk_text: str) -> dict:
keywords = extract_keywords(chunk_text)
return {"keywords": keywords, "keyword_count": len(keywords)}
# Apply both generators
chunker = Chunker(metadata_generators=[ContextGenerator(), KeywordGenerator()])
chunks = chunker.split("document text")
# Each chunk will have 'context', 'keywords', and 'keyword_count' in metadata
AsyncMetadataGenerator Protocol
Protocol for implementing async metadata generators for use with AsyncChunker.
Attributes:
name: A unique identifier for this generator (e.g., "context", "keywords", "summary")max_concurrency: Optional maximum number of concurrent calls for this specific generator. If set, limits concurrent metadata generation. If None or not set, all chunks are processed concurrently without limits.
Methods:
async generate_metadata(text: str, chunk_text: str) -> dict: Asynchronously generate metadata for a chunk. Returns a dictionary that will be merged into the chunk's metadata
Example Implementation:
from chunkmate import AsyncMetadataGenerator
class MyAsyncMetadataGenerator:
def get_max_concurrency(self) -> int | None:
return 5 # Limit to 5 concurrent API calls
async def generate_metadata(self, text: str, chunk_text: str) -> dict:
# Use async LLM client for metadata generation
# AsyncChunker calls this method once per chunk, processing all chunks concurrently
response = await your_llm_client.generate(
prompt=f"Summarize: {chunk_text[:200]}"
)
keywords = await extract_keywords_async(chunk_text)
return {
"context": response.text,
"keywords": keywords,
"summary": response.summary
}
Using Multiple Async Generators with Rate Limiting:
class AsyncContextGenerator:
def get_max_concurrency(self) -> int | None:
return 5 # Limit to 5 concurrent API calls
async def generate_metadata(self, text: str, chunk_text: str) -> dict:
context = await llm.generate(f"Context for: {chunk_text}")
return {"context": context}
class AsyncEmbeddingGenerator:
def get_max_concurrency(self) -> int | None:
return 10 # Can handle more concurrent calls
async def generate_metadata(self, text: str, chunk_text: str) -> dict:
embedding = await embed_text(chunk_text)
return {"embedding": embedding}
# Apply both generators, each with its own concurrency limit
chunker = AsyncChunker(
metadata_generators=[AsyncContextGenerator(), AsyncEmbeddingGenerator()]
)
chunks = await chunker.split("document text")
# Each chunk will have 'context' and 'embedding' in metadata
Controlling Concurrency Per Generator:
Each generator defines its own rate limit via the max_concurrency attribute. This is crucial when calling external APIs with different rate limits:
class SlowAPIGenerator:
def get_max_concurrency(self) -> int | None:
return 3 # This API has strict rate limits
async def generate_metadata(self, text: str, chunk_text: str) -> dict:
# Calls to expensive/rate-limited API
result = await slow_llm_api.generate(chunk_text)
return {"context": result}
class FastLocalGenerator:
def get_max_concurrency(self) -> int | None:
return 50 # Local processing can handle more concurrency
async def generate_metadata(self, text: str, chunk_text: str) -> dict:
# Fast local embedding generation
embedding = await local_embedder.embed(chunk_text)
return {"embedding": embedding}
# Each generator uses its own concurrency limit
chunker = AsyncChunker(
metadata_generators=[SlowAPIGenerator(), FastLocalGenerator()]
)
chunks = await chunker.split("document text")
# SlowAPIGenerator processes with max 3 concurrent calls
# FastLocalGenerator processes with max 50 concurrent calls
# Generators can return None for unlimited concurrency
class UnlimitedGenerator:
def get_max_concurrency(self) -> int | None:
return None # No rate limiting
async def generate_metadata(self, text: str, chunk_text: str) -> dict:
return {"data": await fast_local_operation(chunk_text)}
Converter Functions
chunks_to_langchain(chunks: list[Chunk]) -> list[Document]
Converts Chunkmate chunks to LangChain documents. Requires langchain-core to be installed.
chunks_to_llamaindex(chunks: list[Chunk]) -> list[Document]
Converts Chunkmate chunks to LlamaIndex documents. Requires llama-index-core to be installed.
chunks_to_dict(chunks: list[Chunk]) -> list[dict]
Converts Chunkmate chunks to plain dictionaries. Useful for JSON serialization, database storage, API integration, or working with frameworks that aren't directly supported. Each dictionary contains text, token_count, and metadata keys. No additional dependencies required.
Development
Setup
# Clone the repository
git clone https://github.com/gvre/chunkmate.git
cd chunkmate
# Install dependencies
uv sync
# Run tests
uv run pytest
# Run tests with coverage
uv run pytest --cov=src/chunkmate --cov-report=html
# Run linter
uv run ruff check .
# Format code
uv run ruff format .
Running Tests
# Run all tests
uv run pytest
# Run specific test file
uv run pytest tests/unit/test_chunker.py
# Run with verbose output
uv run pytest -v
# Run with coverage
uv run pytest --cov
# Run with coverage report
uv run pytest --cov=src/chunkmate --cov-report=html
# Open htmlcov/index.html to view detailed coverage report
Test Coverage: The project maintains 100% test coverage for core functionality, including:
- All chunker implementations (
Chunker,AsyncChunker) - All splitters (Markdown, JSON, HTML, Text)
- Context generation (sync and async)
- Rate limiting and concurrent processing
- Error handling and edge cases
Why Chunkmate?
When you need to split text, you need a reliable companion. Someone who can handle any document and split it into perfect chunks with ease. That's Chunkmate - your friendly text chunking mate.
Chunkmate makes text splitting simple, intelligent, and effortless.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Author
Giannis Vrentzos - gvre@gvre.gr
Acknowledgments
- Built on top of LangChain Text Splitters
- Token counting powered by tiktoken
- Note: The core library was designed and hand-written by the author. Tests, documentation (docstrings), and this README were AI-generated.
Remember: With Chunkmate, you're never alone in your text processing journey.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chunkmate-0.2.1.tar.gz.
File metadata
- Download URL: chunkmate-0.2.1.tar.gz
- Upload date:
- Size: 37.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
111981fc34ee2d61c2602ecceca7076e2875df0b8e68b5e09d3f47a1725aba18
|
|
| MD5 |
0ad9964c6acb7deb7662fe3c6162435f
|
|
| BLAKE2b-256 |
4c220c26a51a43e1e85fbced8e73b58beb3aa6c288ac4a0f7da563c60e32077c
|
File details
Details for the file chunkmate-0.2.1-py3-none-any.whl.
File metadata
- Download URL: chunkmate-0.2.1-py3-none-any.whl
- Upload date:
- Size: 41.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26537393cf0539cdb75973a7232e63e8ec60e4458a836d7b82bf81f8bb20869e
|
|
| MD5 |
0bfa41926e48f7c8284827ca31d51b3d
|
|
| BLAKE2b-256 |
545f8502e5905dda95b41d3d7a9db1cec15f7cb101055f2598c426ef21a0b72e
|