Skip to main content

Intelligent text and document chunking library with automatic format detection and AI-powered metadata generation for RAG, LLMs, and text processing pipelines

Project description

🤝 Chunkmate

Chunkmate is your friendly companion for splitting text into manageable chunks!

A Python library for intelligently splitting text and documents into token-aware chunks. Perfect for processing different type of documents for RAG (Retrieval-Augmented Generation) systems, LLM applications, and any text processing pipeline that needs smart chunking.

🚀 Simple. Powerful. Just 2 Lines.

chunker = Chunker()
chunks = chunker.split(Path("document.md"))  # Works with .md, .json, .html, .txt files or raw text!

That's it! Split any document (Markdown, JSON, HTML, or plain text) with automatic format detection and intelligent chunking. Pass a file Path or raw text string. No configuration needed. No complexity. Just results.

Features

  • 🎯 Smart Text Splitting: Intelligently splits text while respecting natural boundaries (paragraphs, sentences, etc.)
  • 📝 Multiple Format Support: Built-in splitters for common document types:
    • Markdown (.md, .mdx, .markdown) - Header-aware splitting that respects document structure
    • JSON (.json) - Recursive splitting while preserving valid JSON structure
    • HTML (.html, .htm) - Converts to Markdown and splits by headers (h2, h3)
    • Plain Text (.txt) - Recursive character-based splitting with smart boundaries
  • 🤖 Auto-Detection: Automatically detects text format (JSON, Markdown, HTML, plain text) when no file extension is available
  • 🔢 Token-Aware: Counts tokens using tiktoken to ensure chunks fit within model limits
  • 🧩 Extensible: Easy to add custom splitters for different document types
  • 🎨 Metadata Generation: Optional metadata enrichment for each chunk (custom or AI-powered) with support for multiple generators to improve RAG retrieval with context, keywords, summaries, etc.
  • Async Support: AsyncChunker for concurrent metadata generation with rate limiting
  • 📁 File or String: Works with both file paths and raw text strings
  • 🎛️ Configurable: Customize chunk size and splitting behavior
  • Fully Tested: 100% test coverage with comprehensive unit tests

Installation

pip install chunkmate

Or using uv:

uv add chunkmate

Optional Dependencies:

For framework integrations, install with optional dependencies:

# With LangChain support
pip install chunkmate[langchain]

# With LlamaIndex support
pip install chunkmate[llamaindex]

# With all optional dependencies
pip install chunkmate[all]

Quick Start

Basic Usage

from chunkmate import Chunker

# Initialize the chunker
chunker = Chunker()

# Split text directly
text = "Your long document text here..."
chunks = chunker.split(text)

# Or split from a file
from pathlib import Path
chunks = chunker.split(Path("document.md"))

# Add custom metadata to all chunks
chunks = chunker.split(text, metadata={"source": "user_input", "category": "docs"})

# Access chunk data
for chunk in chunks:
    print(f"Text: {chunk.text}")
    print(f"Tokens: {chunk.token_count}")
    print(f"Metadata: {chunk.metadata}")

Metadata Generation

Enrich chunks with AI-generated metadata for better retrieval in RAG systems:

from chunkmate import Chunker, MetadataGenerator

# Implement your metadata generator
class MyMetadataGenerator:
    def generate_metadata(self, text: str, chunk_text: str) -> dict:
        # Use your LLM to generate metadata for this chunk
        # text: the full document text
        # chunk_text: the specific chunk text
        # Return a dictionary with any metadata fields you want
        return {
            "context": f"This chunk discusses {chunk_text[:50]}...",
            "doc_topic": text[:50]
        }

# Create chunker with metadata generation
chunker = Chunker(metadata_generators=MyMetadataGenerator())

# Split text - metadata is automatically added to each chunk
chunks = chunker.split("Your document text")

# Each chunk now has generated metadata
for chunk in chunks:
    print(f"Text: {chunk.text}")
    print(f"Context: {chunk.metadata['context']}")
    print(f"Topic: {chunk.metadata['doc_topic']}")

# Use multiple generators to add different types of metadata
class KeywordGenerator:
    def generate_metadata(self, text: str, chunk_text: str) -> dict:
        # Extract keywords from chunk
        keywords = extract_keywords(chunk_text)
        return {"keywords": keywords, "keyword_count": len(keywords)}

chunker = Chunker(metadata_generators=[MyMetadataGenerator(), KeywordGenerator()])
chunks = chunker.split("Your document text")

# Each chunk has metadata from both generators
for chunk in chunks:
    print(f"Context: {chunk.metadata['context']}")
    print(f"Keywords: {chunk.metadata['keywords']}")

# Combine with custom metadata
chunks = chunker.split(
    "Your document text",
    metadata={"source": "user_input", "category": "docs"}
)
# Each chunk will have custom metadata + generated metadata

Async Metadata Generation (Concurrent Processing)

For high-performance applications, use AsyncChunker to generate metadata for all chunks concurrently:

from chunkmate import AsyncChunker, AsyncMetadataGenerator

# Implement your async metadata generator
class MyAsyncMetadataGenerator:
    def get_max_concurrency(self) -> int | None:
        return 5  # Limit to 5 concurrent API calls

    async def generate_metadata(self, text: str, chunk_text: str) -> dict:
        # Use your async LLM client to generate metadata
        # AsyncChunker will call this method concurrently for all chunks
        response = await your_llm_client.generate(
            prompt=f"Summarize this chunk: {chunk_text}"
        )
        return {
            "context": response.text,
            "summary": response.summary
        }

# Create async chunker with concurrent metadata generation
chunker = AsyncChunker(metadata_generators=MyAsyncMetadataGenerator())

# Split text - metadata generation happens concurrently for all chunks
# This provides significant speedup for N chunks when calling I/O-bound APIs
async def process_document():
    chunks = await chunker.split("Your document text")
    return chunks

# With rate limiting to avoid API throttling (recommended)
# Control concurrency via get_max_concurrency() in your metadata generator
class RateLimitedGenerator:
    def get_max_concurrency(self) -> int | None:
        return 5  # Max 5 concurrent API calls

    async def generate_metadata(self, text: str, chunk_text: str) -> dict:
        summary = await call_llm_api(chunk_text)
        return {"summary": summary}

chunker = AsyncChunker(metadata_generators=RateLimitedGenerator())

# This ensures you don't hit rate limits on LLM APIs
chunks = await chunker.split(Path("large_document.md"))

# Use multiple generators - each processes all chunks concurrently
class AsyncKeywordGenerator:
    def get_max_concurrency(self) -> int | None:
        return 10  # Can handle more concurrent calls

    async def generate_metadata(self, text: str, chunk_text: str) -> dict:
        keywords = await extract_keywords_async(chunk_text)
        return {"keywords": keywords, "keyword_count": len(keywords)}

chunker = AsyncChunker(
    metadata_generators=[MyAsyncMetadataGenerator(), AsyncKeywordGenerator()]
)

# All generators run sequentially, but each processes all chunks concurrently
# Each generator's max_concurrency is controlled by its get_max_concurrency() method
chunks = await chunker.split("Your document text")

# Access all generated metadata
for chunk in chunks:
    print(f"Context: {chunk.metadata['context']}")
    print(f"Keywords: {chunk.metadata['keywords']}")

Why AsyncChunker?

  • Concurrent Processing: Metadata generation for all chunks happens concurrently
  • Significant Speedup: Process multiple chunks simultaneously—typically 5-10× faster for I/O-bound operations like LLM API calls
  • Rate Limiting: Built-in max_concurrency parameter to prevent API throttling
  • Multiple Generators: Support for multiple metadata generators, each processing all chunks concurrently
  • Error Handling: Robust error handling with ExceptionGroup for failed chunks
  • Perfect for LLM APIs: Ideal when calling OpenAI, Anthropic, or other async LLM services

Real-World Example: OpenAI Context Generator

Here's a complete, production-ready example using OpenAI's API to generate contextual descriptions for chunks:

import os
import logging
from typing import Any
from openai import AsyncOpenAI
from openai import RateLimitError, APIError
from chunkmate import AsyncChunker, AsyncMetadataGenerator

class AsyncOpenAIContextGenerator:
    def get_max_concurrency(self) -> int | None:
        return self.max_concurrency

    def __init__(
        self, api_key: str | None = None, max_concurrency: int | None = 5
    ) -> None:
        self.api_key: str = api_key if api_key else os.getenv("OPENAI_API_KEY", "")
        if not self.api_key:
            raise ValueError("OPENAI_API_KEY environment variable is not set")
        # Configure with more retries for rate limit handling (default is 2)
        self.client: AsyncOpenAI = AsyncOpenAI(api_key=self.api_key, max_retries=5)
        self.model: str = "gpt-5-nano"
        self.max_concurrency: int | None = max_concurrency

    async def generate_metadata(self, text: str, chunk_text: str) -> dict[str, Any]:
        """
        Generate context metadata for a chunk using the OpenAI API.

        Uses prompt caching by passing the complete document text, which is cached
        across multiple requests for the same document, reducing costs significantly.

        Automatically retries up to 5 times on rate limits and API errors with
        exponential backoff. Returns an empty context payload on failure for
        graceful degradation.

        Args:
            text: The full document text.
            chunk_text: The specific chunk text to generate context for.

        Returns:
            dict[str, Any]: A dictionary with the 'context' key containing a brief
            2-3 sentence description, or an empty string if generation fails.
        """
        # Structure prompt to maximize caching: put the full document first (cached),
        # then the variable part (the specific chunk) last
        system_message = """You are a helpful assistant that generates brief, contextual descriptions for text chunks. Given a complete document and a specific chunk from it, provide a brief, succinct context (2-3 sentences max) that describes what this chunk is about and how it relates to the overall document."""

        user_message = f"""Complete Document:
{text}

---

For the following chunk from the above document, provide a brief context:

Chunk:
{chunk_text}

Context:"""
        try:
            response = await self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": system_message},
                    {"role": "user", "content": user_message},
                ],
            )

            content = response.choices[0].message.content
            if content is None:
                logging.warning("OpenAI returned None content")
                return {"context": ""}
            return {"context": content.strip()}
        except RateLimitError as e:
            logging.warning(f"Rate limit hit: {e}")
            return {"context": ""}
        except APIError as e:
            logging.error(f"OpenAI API error: {e}")
            return {"context": ""}
        except Exception as e:
            logging.error(f"Unexpected error generating context: {e}", exc_info=True)
            return {"context": ""}


class OpenAIContextGenerator:
    def __init__(self) -> None:
        self.api_key: str = os.getenv("OPENAI_API_KEY", "")
        if not self.api_key:
            raise ValueError("OPENAI_API_KEY environment variable is not set")
        # Configure with more retries for rate limit handling (default is 2)
        self.client: OpenAI = OpenAI(api_key=self.api_key, max_retries=5)
        self.model: str = "gpt-5-nano"

    def generate_metadata(self, text: str, chunk_text: str) -> dict[str, Any]:
        """
        Generate context metadata for a chunk using the OpenAI API.

        Uses prompt caching by passing the complete document text, which is cached
        across multiple requests for the same document, reducing costs significantly.

        Automatically retries up to 5 times on rate limits and API errors with
        exponential backoff. Returns an empty context payload on failure for
        graceful degradation.

        Args:
            text: The full document text.
            chunk_text: The specific chunk text to generate context for.

        Returns:
            dict[str, Any]: A dictionary with the 'context' key containing a brief
            2-3 sentence description, or an empty string if generation fails.
        """
        # Structure prompt to maximize caching: put the full document first (cached),
        # then the variable part (the specific chunk) last
        system_message = """You are a helpful assistant that generates brief, contextual descriptions for text chunks. Given a complete document and a specific chunk from it, provide a brief, succinct context (2-3 sentences max) that describes what this chunk is about and how it relates to the overall document."""

        user_message = f"""Complete Document:
{text}

---

For the following chunk from the above document, provide a brief context:

Chunk:
{chunk_text}

Context:"""

        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": system_message},
                    {"role": "user", "content": user_message},
                ],
            )

            content = response.choices[0].message.content
            if content is None:
                logging.warning("OpenAI returned None content")
                return {"context": ""}
            return {"context": content.strip()}
        except RateLimitError as e:
            logging.warning(f"Rate limit hit: {e}")
            return {"context": ""}
        except APIError as e:
            logging.error(f"OpenAI API error: {e}")
            return {"context": ""}
        except Exception as e:
            logging.error(f"Unexpected error generating context: {e}", exc_info=True)
            return {"context": ""}

# Usage example (Async)
async def process_document_with_openai_async():
    # Create the context generator
    # Rate limiting is controlled via get_max_concurrency() in the generator
    openai_generator = AsyncOpenAIContextGenerator()

    # Create chunker with OpenAI metadata generation
    chunker = AsyncChunker(
        metadata_generators=openai_generator
    )

    # Process your document
    chunks = await chunker.split("Your long document text here...")

    # Each chunk now has AI-generated context
    for chunk in chunks:
        print(f"Chunk: {chunk.text[:100]}...")
        print(f"Context: {chunk.metadata['context']}")
        print("---")

    return chunks

# Run the async function
import asyncio
asyncio.run(process_document_with_openai_async())

# Usage example (Sync)
def process_document_with_openai_sync():
    # Create the synchronous context generator
    openai_generator = OpenAIContextGenerator()

    # Create chunker with OpenAI metadata generation
    chunker = Chunker(
        metadata_generators=openai_generator
    )

    # Process your document
    chunks = chunker.split("Your long document text here...")

    # Each chunk now has AI-generated context
    for chunk in chunks:
        print(f"Chunk: {chunk.text[:100]}...")
        print(f"Context: {chunk.metadata['context']}")
        print("---")

    return chunks

# Run the sync function
process_document_with_openai_sync()

Key Features:

  • Prompt Caching: Structures prompts to maximize OpenAI's caching benefits by putting the full document first (cached across requests)
  • Automatic Retries: Configures 5 retries with exponential backoff for rate limits and API errors
  • Graceful Degradation: Returns empty context on failure instead of crashing
  • Cost Efficient: Using gpt-5-nano for cost-effective context generation
  • Rate Limit Protection: Async version uses max_concurrency=5 to prevent overwhelming the API
  • Production Ready: Includes comprehensive error handling and logging
  • Flexible: Choose async for concurrent processing or sync for simpler sequential processing

Cost Optimization Tips:

  • The prompt structure maximizes caching - the full document is cached, reducing costs for subsequent chunks
  • Adjust max_concurrency based on your OpenAI rate limits (async version only)

Auto-Detection of Text Format

When splitting text directly (without a file extension), Chunkmate automatically detects the format:

from chunkmate import Chunker

chunker = Chunker()

# JSON content is automatically detected
json_text = '{"users": [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]}'
chunks = chunker.split(json_text)
# Uses JSONSplitter automatically

# HTML content is automatically detected
html_text = """
<html>
<body>
    <h1>Welcome</h1>
    <p>This is HTML content.</p>
</body>
</html>
"""
chunks = chunker.split(html_text)
# Uses HTMLSplitter automatically

# Markdown content is automatically detected
markdown_text = """
## Introduction
This is a markdown document.

### Features
- Feature 1
- Feature 2
"""
chunks = chunker.split(markdown_text)
# Uses MarkdownSplitter automatically

# Plain text is handled by TextSplitter
plain_text = "This is just plain text without any special formatting."
chunks = chunker.split(plain_text)
# Uses TextSplitter automatically

Custom Splitters

Create your own splitter for specific document types:

from chunkmate import Chunker, Splitter, Chunk
import csv
from io import StringIO

class CSVSplitter(Splitter):
    def __init__(self, max_chunk_size: int = 512, rows_per_chunk: int = 100):
        self.max_chunk_size = max_chunk_size
        self.rows_per_chunk = rows_per_chunk

    def split(self, text: str) -> list[Chunk]:
        """Split CSV into chunks, each containing a batch of rows with header."""
        reader = csv.reader(StringIO(text))
        rows = list(reader)

        if not rows:
            return []

        header = rows[0]
        data_rows = rows[1:]
        chunks = []

        # Split data into batches, each with the header
        for i in range(0, len(data_rows), self.rows_per_chunk):
            batch = [header] + data_rows[i:i + self.rows_per_chunk]
            chunk_text = '\n'.join(','.join(row) for row in batch)
            chunks.append(Chunk(
                text=chunk_text,
                token_count=len(chunk_text.split()),
                metadata={"rows": len(batch) - 1, "has_header": True}
            ))

        return chunks

    def can_handle(self, text: str) -> bool:
        """Detect if text is CSV format by checking for comma-separated structure."""
        try:
            lines = text.strip().split('\n')
            if len(lines) < 2:
                return False

            # Check if at least 80% of lines have the same number of commas
            comma_counts = [line.count(',') for line in lines[:10]]
            if not comma_counts or comma_counts[0] == 0:
                return False

            consistent = sum(1 for c in comma_counts if c == comma_counts[0])
            return consistent / len(comma_counts) >= 0.8
        except:
            return False

# Register your custom splitter
chunker = Chunker(
    splitters={"csv": CSVSplitter(rows_per_chunk=50)},
    extension_map={"csv": "csv", "tsv": "csv"}
)

# Now .csv files will use your custom splitter
chunks = chunker.split(Path("data.csv"))

# CSV text is auto-detected too!
csv_text = """name,age,city
Alice,30,NYC
Bob,25,LA
Charlie,35,Chicago"""
chunks = chunker.split(csv_text)  # Automatically uses CSVSplitter
print(f"Split into {len(chunks)} chunks")
print(f"First chunk has {chunks[0].metadata['rows']} rows")

Important: The order of splitters matters for auto-detection. When registering custom splitters, more specialized splitters should be added before general-purpose ones. TextSplitter should always be last as it accepts any text format.

Configure Chunk Size

from chunkmate import TextSplitter, MarkdownSplitter, Chunker

# Create splitters with custom chunk sizes
text_splitter = TextSplitter(max_chunk_size=1024)
markdown_splitter = MarkdownSplitter(max_chunk_size=1024)

# Use in chunker
chunker = Chunker(
    splitters={
        "text": text_splitter,
        "markdown": markdown_splitter,
    }
)

Custom Token Counting

By default, Chunkmate uses o200k_base encoding (for GPT-4o/GPT-4o-mini or newer models). For older models, provide a custom token counter:

import tiktoken
from chunkmate import TextSplitter

# For GPT-4, GPT-3.5-turbo, or text-embedding models
def cl100k_counter(text: str) -> int:
    encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

splitter = TextSplitter(
    max_chunk_size=512,
    count_tokens=cl100k_counter
)

# Or use a simple word-based counter
def word_counter(text: str) -> int:
    return len(text.split())

splitter = TextSplitter(count_tokens=word_counter)

Encoding Compatibility:

  • o200k_base (default): GPT-4o, GPT-4o-mini, GPT-5, GPT-5-mini, GPT-5-nano
  • cl100k_base: GPT-4, GPT-4-turbo, GPT-3.5-turbo, text-embedding-3-*
  • p50k_base: GPT-3 models (Davinci, Curie, etc.)

Convert to Other Formats

Chunkmate provides converters to integrate with popular frameworks:

from chunkmate import Chunker, chunks_to_langchain, chunks_to_llamaindex, chunks_to_dict

chunker = Chunker()
chunks = chunker.split("Your document text here...")

# Convert to LangChain documents
langchain_docs = chunks_to_langchain(chunks)
# Returns: list[langchain_core.documents.Document]

# Convert to LlamaIndex documents
llamaindex_docs = chunks_to_llamaindex(chunks)
# Returns: list[llama_index.core.schema.Document]

# Convert to plain dictionaries (for JSON serialization, databases, etc.)
chunk_dicts = chunks_to_dict(chunks)
# Returns: list[dict] with keys: text, token_count, metadata
# Example output: [
#     {'text': 'Hello', 'token_count': 1, 'metadata': {'section': 'intro'}},
#     {'text': 'World', 'token_count': 1, 'metadata': {'section': 'body'}}
# ]

Installation with Optional Dependencies:

# Install with LangChain support
pip install chunkmate[langchain]

# Install with LlamaIndex support
pip install chunkmate[llamaindex]

# Install with both LangChain and LlamaIndex
pip install chunkmate[all]

# Using uv
uv add chunkmate[langchain]
uv add chunkmate[llamaindex]
uv add chunkmate[all]

Supported File Types

By default, Chunkmate supports:

  • Markdown: .md, .mdx, .markdown - Uses header-aware splitting
  • JSON: .json - Uses recursive JSON splitting while preserving structure
  • HTML: .html, .htm - Uses header-aware HTML splitting
  • Text: .txt - Uses recursive character splitting

Auto-Detection

When processing raw text strings (no file extension), Chunkmate automatically detects the format:

  • JSON Detection: Checks if text is valid JSON (objects or arrays)
  • Markdown Detection: Checks for headers (#), code blocks (```), links, lists, and other markdown patterns (requires 2+ indicators)
  • HTML Detection: Checks for HTML tags, attributes, document structure (requires 3+ indicators)
  • Text Fallback: If no specific format is detected, falls back to plain text splitting

The detection uses a scoring system that requires multiple format indicators to avoid false positives. The order matters: JSON is checked first, then Markdown, then HTML, and finally Text as a universal fallback. This ensures Markdown files with embedded HTML are correctly detected as Markdown.

You can easily add support for more file types using custom splitters and extension mappings. When implementing custom splitters, make sure to implement the can_handle() method for automatic format detection.

API Reference

Chunker

Main class for chunking documents (synchronous).

Chunker(
    splitters: dict[str, Splitter] | None = None,
    extension_map: dict[str, str] | None = None,
    logger: logging.Logger | None = None,
    metadata_generators: list[MetadataGenerator] | MetadataGenerator | None = None
)

Parameters:

  • splitters: Optional dict of custom splitters to add or override defaults
  • extension_map: Optional dict mapping file extensions to splitter types
  • logger: Optional logger for debugging and operation tracking
  • metadata_generators: Optional MetadataGenerator instance or list of MetadataGenerator instances for enriching chunks with metadata. Multiple generators are applied sequentially, with each adding different metadata fields to the chunks

Methods:

  • split(text: str | Path, metadata: dict | None = None) -> list[Chunk]: Split text or file into chunks with optional metadata generation

Example:

from chunkmate import Chunker, MetadataGenerator

class ContextGenerator:
    def generate_metadata(self, text: str, chunk_text: str) -> dict:
        return {"context": f"Summary: {chunk_text[:100]}"}

# Single generator
chunker = Chunker(metadata_generators=ContextGenerator())
chunks = chunker.split("document.md")

# Multiple generators
chunker = Chunker(metadata_generators=[
    ContextGenerator(),
    KeywordGenerator()
])
chunks = chunker.split("document text")
# Each chunk has metadata from all generators

AsyncChunker

Async version of Chunker with concurrent metadata generation and rate limiting.

AsyncChunker(
    splitters: dict[str, Splitter] | None = None,
    extension_map: dict[str, str] | None = None,
    logger: logging.Logger | None = None,
    metadata_generators: list[AsyncMetadataGenerator] | AsyncMetadataGenerator | None = None,
    max_concurrency: int | None = None
)

Parameters:

  • splitters: Optional dict of custom splitters to add or override defaults
  • extension_map: Optional dict mapping file extensions to splitter types
  • logger: Optional logger for debugging and operation tracking
  • metadata_generators: Optional AsyncMetadataGenerator instance or list of AsyncMetadataGenerator instances for concurrent metadata enrichment. Multiple generators are applied sequentially, with each processing all chunks concurrently
  • max_concurrency: Optional maximum number of concurrent metadata generation tasks per generator (recommended: 3-10 for LLM APIs)

Methods:

  • async split(text: str | Path, metadata: dict | None = None) -> list[Chunk]: Split text or file into chunks with concurrent metadata generation

Key Benefits:

  • Concurrent Processing: All chunks are processed concurrently for metadata generation, providing significant speedup when calling I/O-bound services (like LLM APIs)
  • Rate Limiting: Use max_concurrency to avoid API throttling (applied per generator)
  • Multiple Generators: Support for multiple metadata generators, each adding different metadata fields
  • Error Handling: Uses ExceptionGroup to report all metadata generation failures

Example:

from chunkmate import AsyncChunker, AsyncMetadataGenerator

class ContextGenerator:
    def get_max_concurrency(self) -> int | None:
        return 5

    async def generate_metadata(self, text: str, chunk_text: str) -> dict:
        context = await llm.generate(f"Summarize: {chunk_text}")
        return {"context": context}

class KeywordGenerator:
    def get_max_concurrency(self) -> int | None:
        return 10

    async def generate_metadata(self, text: str, chunk_text: str) -> dict:
        keywords = await extract_keywords(chunk_text)
        return {"keywords": keywords}

# Use multiple generators with rate limiting
chunker = AsyncChunker(
    metadata_generators=[ContextGenerator(), KeywordGenerator()],
    max_concurrency=5  # Max 5 concurrent API calls per generator
)

chunks = await chunker.split("document.md")
# Each chunk has both 'context' and 'keywords' in metadata

Chunk

Data class representing a text chunk.

Attributes:

  • text: str: The chunk text
  • token_count: int: Number of tokens in the chunk
  • metadata: dict: Additional metadata (e.g., context, headers, source)

MarkdownSplitter

Splits markdown documents respecting header structure. Automatically detects markdown content by looking for headers, code blocks, lists, and other markdown patterns.

MarkdownSplitter(
    max_chunk_size: int = 512,
    count_tokens: Callable[[str], int] = default_token_counter
)

Methods:

  • split(text: str) -> list[Chunk]: Split markdown text into chunks
  • can_handle(text: str) -> bool: Returns True if text contains 2+ markdown patterns

JSONSplitter

Splits JSON documents recursively while preserving valid JSON structure in each chunk. Ensures each chunk contains valid, parseable JSON.

JSONSplitter(
    max_chunk_size: int = 512,
    count_tokens: Callable[[str], int] = default_token_counter
)

Methods:

  • split(text: str) -> list[Chunk]: Split JSON text into chunks with valid JSON in each
  • can_handle(text: str) -> bool: Returns True if text is valid JSON

HTMLSplitter

Splits HTML documents by headers (h2, h3) while preserving document structure. Automatically detects HTML content by checking for tags, attributes, and document structure.

HTMLSplitter(
    max_chunk_size: int = 512,
    count_tokens: Callable[[str], int] = default_token_counter
)

Methods:

  • split(text: str) -> list[Chunk]: Split HTML text into chunks by header boundaries
  • can_handle(text: str) -> bool: Returns True if text contains 3+ HTML indicators

TextSplitter

Splits plain text using recursive character splitting. Acts as a universal fallback that can handle any text.

TextSplitter(
    max_chunk_size: int = 512,
    count_tokens: Callable[[str], int] = default_token_counter
)

Methods:

  • split(text: str) -> list[Chunk]: Split text into chunks
  • can_handle(text: str) -> bool: Always returns True (accepts any text)

Splitter Protocol

All splitters must implement this protocol:

Methods:

  • split(text: str) -> list[Chunk]: Split text into chunks
  • can_handle(text: str) -> bool: Determine if the splitter can handle the given text format

MetadataGenerator Protocol

Protocol for implementing synchronous metadata generators that enrich chunks with additional information.

Attributes:

  • name: A unique identifier for this generator (e.g., "context", "keywords", "summary")

Methods:

  • generate_metadata(text: str, chunk_text: str) -> dict: Generate metadata for a chunk. Returns a dictionary that will be merged into the chunk's metadata

Example Implementation:

from chunkmate import MetadataGenerator

class MyMetadataGenerator:
    def generate_metadata(self, text: str, chunk_text: str) -> dict:
        # Your custom logic here - typically calling an LLM
        context = f"This chunk is from a document about: {text[:100]}..."
        summary = chunk_text[:50]
        return {
            "context": context,
            "summary": summary,
            "length": len(chunk_text)
        }

Using Multiple Generators:

class ContextGenerator:
    def generate_metadata(self, text: str, chunk_text: str) -> dict:
        return {"context": generate_context(chunk_text)}

class KeywordGenerator:
    def generate_metadata(self, text: str, chunk_text: str) -> dict:
        keywords = extract_keywords(chunk_text)
        return {"keywords": keywords, "keyword_count": len(keywords)}

# Apply both generators
chunker = Chunker(metadata_generators=[ContextGenerator(), KeywordGenerator()])
chunks = chunker.split("document text")
# Each chunk will have 'context', 'keywords', and 'keyword_count' in metadata

AsyncMetadataGenerator Protocol

Protocol for implementing async metadata generators for use with AsyncChunker.

Attributes:

  • name: A unique identifier for this generator (e.g., "context", "keywords", "summary")
  • max_concurrency: Optional maximum number of concurrent calls for this specific generator. If set, limits concurrent metadata generation. If None or not set, all chunks are processed concurrently without limits.

Methods:

  • async generate_metadata(text: str, chunk_text: str) -> dict: Asynchronously generate metadata for a chunk. Returns a dictionary that will be merged into the chunk's metadata

Example Implementation:

from chunkmate import AsyncMetadataGenerator

class MyAsyncMetadataGenerator:
    def get_max_concurrency(self) -> int | None:
        return 5  # Limit to 5 concurrent API calls

    async def generate_metadata(self, text: str, chunk_text: str) -> dict:
        # Use async LLM client for metadata generation
        # AsyncChunker calls this method once per chunk, processing all chunks concurrently
        response = await your_llm_client.generate(
            prompt=f"Summarize: {chunk_text[:200]}"
        )
        keywords = await extract_keywords_async(chunk_text)
        return {
            "context": response.text,
            "keywords": keywords,
            "summary": response.summary
        }

Using Multiple Async Generators with Rate Limiting:

class AsyncContextGenerator:
    def get_max_concurrency(self) -> int | None:
        return 5  # Limit to 5 concurrent API calls

    async def generate_metadata(self, text: str, chunk_text: str) -> dict:
        context = await llm.generate(f"Context for: {chunk_text}")
        return {"context": context}

class AsyncEmbeddingGenerator:
    def get_max_concurrency(self) -> int | None:
        return 10  # Can handle more concurrent calls

    async def generate_metadata(self, text: str, chunk_text: str) -> dict:
        embedding = await embed_text(chunk_text)
        return {"embedding": embedding}

# Apply both generators, each with its own concurrency limit
chunker = AsyncChunker(
    metadata_generators=[AsyncContextGenerator(), AsyncEmbeddingGenerator()]
)

chunks = await chunker.split("document text")
# Each chunk will have 'context' and 'embedding' in metadata

Controlling Concurrency Per Generator:

Each generator defines its own rate limit via the max_concurrency attribute. This is crucial when calling external APIs with different rate limits:

class SlowAPIGenerator:
    def get_max_concurrency(self) -> int | None:
        return 3  # This API has strict rate limits

    async def generate_metadata(self, text: str, chunk_text: str) -> dict:
        # Calls to expensive/rate-limited API
        result = await slow_llm_api.generate(chunk_text)
        return {"context": result}

class FastLocalGenerator:
    def get_max_concurrency(self) -> int | None:
        return 50  # Local processing can handle more concurrency

    async def generate_metadata(self, text: str, chunk_text: str) -> dict:
        # Fast local embedding generation
        embedding = await local_embedder.embed(chunk_text)
        return {"embedding": embedding}

# Each generator uses its own concurrency limit
chunker = AsyncChunker(
    metadata_generators=[SlowAPIGenerator(), FastLocalGenerator()]
)

chunks = await chunker.split("document text")
# SlowAPIGenerator processes with max 3 concurrent calls
# FastLocalGenerator processes with max 50 concurrent calls

# Generators can return None for unlimited concurrency
class UnlimitedGenerator:
    def get_max_concurrency(self) -> int | None:
        return None  # No rate limiting

    async def generate_metadata(self, text: str, chunk_text: str) -> dict:
        return {"data": await fast_local_operation(chunk_text)}

Converter Functions

chunks_to_langchain(chunks: list[Chunk]) -> list[Document]

Converts Chunkmate chunks to LangChain documents. Requires langchain-core to be installed.

chunks_to_llamaindex(chunks: list[Chunk]) -> list[Document]

Converts Chunkmate chunks to LlamaIndex documents. Requires llama-index-core to be installed.

chunks_to_dict(chunks: list[Chunk]) -> list[dict]

Converts Chunkmate chunks to plain dictionaries. Useful for JSON serialization, database storage, API integration, or working with frameworks that aren't directly supported. Each dictionary contains text, token_count, and metadata keys. No additional dependencies required.

Development

Setup

# Clone the repository
git clone https://github.com/gvre/chunkmate.git
cd chunkmate

# Install dependencies
uv sync

# Run tests
uv run pytest

# Run tests with coverage
uv run pytest --cov=src/chunkmate --cov-report=html

# Run linter
uv run ruff check .

# Format code
uv run ruff format .

Running Tests

# Run all tests
uv run pytest

# Run specific test file
uv run pytest tests/unit/test_chunker.py

# Run with verbose output
uv run pytest -v

# Run with coverage
uv run pytest --cov

# Run with coverage report
uv run pytest --cov=src/chunkmate --cov-report=html
# Open htmlcov/index.html to view detailed coverage report

Test Coverage: The project maintains 100% test coverage for core functionality, including:

  • All chunker implementations (Chunker, AsyncChunker)
  • All splitters (Markdown, JSON, HTML, Text)
  • Context generation (sync and async)
  • Rate limiting and concurrent processing
  • Error handling and edge cases

Why Chunkmate?

When you need to split text, you need a reliable companion. Someone who can handle any document and split it into perfect chunks with ease. That's Chunkmate - your friendly text chunking mate.

Chunkmate makes text splitting simple, intelligent, and effortless.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Giannis Vrentzos - gvre@gvre.gr

Acknowledgments

  • Built on top of LangChain Text Splitters
  • Token counting powered by tiktoken
  • Note: The core library was designed and hand-written by the author. Tests, documentation (docstrings), and this README were AI-generated.

Remember: With Chunkmate, you're never alone in your text processing journey.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkmate-0.1.0.tar.gz (36.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkmate-0.1.0-py3-none-any.whl (40.4 kB view details)

Uploaded Python 3

File details

Details for the file chunkmate-0.1.0.tar.gz.

File metadata

  • Download URL: chunkmate-0.1.0.tar.gz
  • Upload date:
  • Size: 36.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for chunkmate-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7ced51eb248859e32554de2b26a09c70ac5e089e7d7327231c8b13771588ecf4
MD5 9eb2157889c3a72274cf07cdf3e9117d
BLAKE2b-256 b24122933739894cb27ae0c235281d5d2a56c9e34de06478c33b5abd74201ab4

See more details on using hashes here.

File details

Details for the file chunkmate-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: chunkmate-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for chunkmate-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 33981b4bbbe96d3e68119df2d77b004088f127a795e4d598a0dbecab0036b73b
MD5 0d87286278d1ab6e6ad5ae83d4d6181a
BLAKE2b-256 3000e25b93f2c506b0fac560dd4ad3d3f27d249dc27b770eb6be8e75ab3a12d3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page