Highly performant python library for recursive semantic text chunking while respecting token limits.
Project description
TikChunk: Semantic Text Chunker
A Python library for quickly chunking text documents at semantic boundaries while respecting token limits. Designed for RAG (Retrieval-Augmented Generation) applications that need to split documents into meaningful, token-constrained segments.
Performance: Extremely performant python-based semantic chunker. Chunks the entire NLTK Gutenberg corpus in ~2.8 seconds on an M1 Mac with max_tokens=512.
Features
- Semantic boundary detection: Splits text at natural breakpoints (paragraphs, sentences, clauses) rather than arbitrary character counts
- Token-aware chunking: Uses tiktoken to ensure chunks stay within specified token limits
- Hierarchical splitting: Progressively splits at different semantic levels (paragraphs → sentences → clauses → words)
- Delimiter preservation: Maintains delimiters in the output for natural text flow
- Intelligent merging: Combines smaller chunks to maximize token usage without exceeding limits
- Flexible output: Returns either text chunks or interval boundaries
Quick Start
pip install tikchunk
import tiktoken
from tikchunk import Chunker
# Initialize with your text and encoding
encoding = tiktoken.get_encoding("cl100k_base")
text = "Your long document text here..."
# Create chunker with desired max tokens per chunk
chunker = Chunker(
encoding=encoding,
text=text,
max_tokens=512
)
# Generate chunks
for chunk in chunker.chunk():
print(chunk)
print("---")
How It Works
The chunker uses a priority-based splitting strategy:
- Priority 0: Paragraph breaks (
\n\n\n,\r\n\r\n\r\n,\n\n,\r\n\r\n) - Priority 1: Line breaks and dividers (
\n---\n,\n===\n,\n***\n,\r\n,\n,\r) - Priority 2: Sentence endings (
.,!,?,.,!,?) - Priority 3: Clause separators (
;,:,;,:,--,—,–,--,—,–) - Priority 4: Phrase separators (
,,,,...,…) - Priority 5: Word boundaries (
)
When a chunk exceeds max_tokens, the algorithm:
- Splits at the current priority level
- Merges adjacent segments up to the token limit
- Recursively processes any remaining oversized chunks at the next priority level
This ensures text is split at the most semantically meaningful boundaries possible while staying within token constraints.
API Reference
Chunker
Parameters:
encoding(tiktoken.Encoding): The tokenizer encoding to usetext(str): The text to chunkmax_tokens(int, optional): Maximum tokens per chunk. Default: 512as_text(bool, optional): Return text chunks (True) or Interval objects (False). Default: True
Methods:
chunk(): Returns a generator yielding text chunks or Intervals
chunk(text, tok_prefix_sum, max_tokens)
Low-level function for custom chunking workflows.
Parameters:
text(str): Text to chunktok_prefix_sum(np.ndarray): Prefix sum array of token positionsmax_tokens(int): Maximum tokens per chunk
Returns:
list[Interval]: List of text intervals representing chunks
Use Cases
- RAG pipelines: Split documents for vector database ingestion
- Long-context processing: Break documents into manageable segments for LLM processing
- Document analysis: Create semantically coherent text segments for analysis
- Context window management: Ensure text fits within model token limits
Advanced Usage
Getting Interval Boundaries
chunker = Chunker(
encoding=encoding,
text=text,
max_tokens=512,
as_text=False # Return Interval objects
)
for interval in chunker.chunk():
print(f"Chunk from {interval.start} to {interval.end}")
print(text[interval.start:interval.end])
Implementation Details
- Uses regex-based pattern matching for efficient delimiter detection
- Employs numpy for fast token prefix sum calculations
- Implements a stack-based iterative approach to avoid recursion performance costs
- Preserves delimiters to maintain natural text readability
License
MIT License
Contributing
Contributions welcome! Please ensure any changes maintain semantic splitting behavior and include appropriate tests. As this project is open source, ensure minimum coverage constraints are met, and include relevant property tests created via hypothesis
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tikchunk-0.0.5.tar.gz.
File metadata
- Download URL: tikchunk-0.0.5.tar.gz
- Upload date:
- Size: 46.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1578fb7ac7015a7d6d76fa71a223f0400e99cb9d394822d32b0a688fac23bba
|
|
| MD5 |
d592101bd8a30b583cbbc0926c04cb01
|
|
| BLAKE2b-256 |
93c2095020853143674fa53827bb0a7ef254b5a839edda7a88a3ed87303dbd2e
|
File details
Details for the file tikchunk-0.0.5-py3-none-any.whl.
File metadata
- Download URL: tikchunk-0.0.5-py3-none-any.whl
- Upload date:
- Size: 5.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
651c24ca89455dc8be4b606d7aee95aaa20cf9336e76839fe9ce65b30a1ff83a
|
|
| MD5 |
8848f4260030c250028e754d6be76557
|
|
| BLAKE2b-256 |
50c5a183421db7e881231ae7cedcabc81e907a7fa76defde7c84f49fe1d2c673
|