Highly performant python library for recursive semantic text chunking while respecting token limits.

Project description

TikChunk: Semantic Text Chunker

A Python library for quickly chunking text documents at semantic boundaries while respecting token limits. Designed for RAG (Retrieval-Augmented Generation) applications that need to split documents into meaningful, token-constrained segments.

Performance: Extremely performant python-based semantic chunker. Chunks the entire NLTK Gutenberg corpus in ~2.8 seconds on an M1 Mac with max_tokens=512.

Features

Semantic boundary detection: Splits text at natural breakpoints (paragraphs, sentences, clauses) rather than arbitrary character counts
Token-aware chunking: Uses tiktoken to ensure chunks stay within specified token limits
Hierarchical splitting: Progressively splits at different semantic levels (paragraphs → sentences → clauses → words)
Delimiter preservation: Maintains delimiters in the output for natural text flow
Intelligent merging: Combines smaller chunks to maximize token usage without exceeding limits
Flexible output: Returns either text chunks or interval boundaries

Quick Start

pip install tikchunk

import tiktoken
from tikchunk import Chunker

# Initialize with your text and encoding
encoding = tiktoken.get_encoding("cl100k_base")
text = "Your long document text here..."

# Create chunker with desired max tokens per chunk
chunker = Chunker(
    encoding=encoding,
    text=text,
    max_tokens=512
)

# Generate chunks
for chunk in chunker.chunk():
    print(chunk)
    print("---")

How It Works

The chunker uses a priority-based splitting strategy:

Priority 0: Paragraph breaks (\n\n\n, \r\n\r\n\r\n, \n\n, \r\n\r\n)
Priority 1: Line breaks and dividers (\n---\n, \n===\n, \n***\n, \r\n, \n, \r)
Priority 2: Sentence endings (. , ! , ? , ., !, ?)
Priority 3: Clause separators (; , : , ;, :, --, —, –, --, —, –)
Priority 4: Phrase separators (, , ,, ..., …)
Priority 5: Word boundaries ( )

When a chunk exceeds max_tokens, the algorithm:

Splits at the current priority level
Merges adjacent segments up to the token limit
Recursively processes any remaining oversized chunks at the next priority level

This ensures text is split at the most semantically meaningful boundaries possible while staying within token constraints.

API Reference

`Chunker`

Parameters:

encoding (tiktoken.Encoding): The tokenizer encoding to use
text (str): The text to chunk
max_tokens (int, optional): Maximum tokens per chunk. Default: 512
as_text (bool, optional): Return text chunks (True) or Interval objects (False). Default: True

Methods:

chunk(): Returns a generator yielding text chunks or Intervals

`chunk(text, tok_prefix_sum, max_tokens)`

Low-level function for custom chunking workflows.

Parameters:

text (str): Text to chunk
tok_prefix_sum (np.ndarray): Prefix sum array of token positions
max_tokens (int): Maximum tokens per chunk

Returns:

list[Interval]: List of text intervals representing chunks

Use Cases

RAG pipelines: Split documents for vector database ingestion
Long-context processing: Break documents into manageable segments for LLM processing
Document analysis: Create semantically coherent text segments for analysis
Context window management: Ensure text fits within model token limits

Advanced Usage

Getting Interval Boundaries

chunker = Chunker(
    encoding=encoding,
    text=text,
    max_tokens=512,
    as_text=False  # Return Interval objects
)

for interval in chunker.chunk():
    print(f"Chunk from {interval.start} to {interval.end}")
    print(text[interval.start:interval.end])

Implementation Details

Uses regex-based pattern matching for efficient delimiter detection
Employs numpy for fast token prefix sum calculations
Implements a stack-based iterative approach to avoid recursion performance costs
Preserves delimiters to maintain natural text readability

License

MIT License

Contributing

Contributions welcome! Please ensure any changes maintain semantic splitting behavior and include appropriate tests. As this project is open source, ensure minimum coverage constraints are met, and include relevant property tests created via hypothesis

Project details

Release history Release notifications | RSS feed

This version

0.0.5

Dec 27, 2025

0.0.4

Dec 26, 2025

0.0.3

Dec 21, 2025

0.0.2

Dec 21, 2025

0.0.1

Dec 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tikchunk-0.0.5.tar.gz (46.4 kB view details)

Uploaded Dec 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tikchunk-0.0.5-py3-none-any.whl (5.3 kB view details)

Uploaded Dec 27, 2025 Python 3

File details

Details for the file tikchunk-0.0.5.tar.gz.

File metadata

Download URL: tikchunk-0.0.5.tar.gz
Upload date: Dec 27, 2025
Size: 46.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for tikchunk-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`e1578fb7ac7015a7d6d76fa71a223f0400e99cb9d394822d32b0a688fac23bba`
MD5	`d592101bd8a30b583cbbc0926c04cb01`
BLAKE2b-256	`93c2095020853143674fa53827bb0a7ef254b5a839edda7a88a3ed87303dbd2e`

See more details on using hashes here.

File details

Details for the file tikchunk-0.0.5-py3-none-any.whl.

File metadata

Download URL: tikchunk-0.0.5-py3-none-any.whl
Upload date: Dec 27, 2025
Size: 5.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for tikchunk-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`651c24ca89455dc8be4b606d7aee95aaa20cf9336e76839fe9ce65b30a1ff83a`
MD5	`8848f4260030c250028e754d6be76557`
BLAKE2b-256	`50c5a183421db7e881231ae7cedcabc81e907a7fa76defde7c84f49fe1d2c673`

See more details on using hashes here.

tikchunk 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

TikChunk: Semantic Text Chunker

Features

Quick Start

How It Works

API Reference

`Chunker`

`chunk(text, tok_prefix_sum, max_tokens)`

Use Cases

Advanced Usage

Getting Interval Boundaries

Implementation Details

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes