Skip to main content

Highly performant python library for recursive semantic text chunking while respecting token limits.

Project description

TikChunk: Semantic Text Chunker

A Python library for quickly chunking text documents at semantic boundaries while respecting token limits. Designed for RAG (Retrieval-Augmented Generation) applications that need to split documents into meaningful, token-constrained segments.

Performance: Extremely performant python-based semantic chunker. Chunks the entire NLTK Gutenberg corpus in ~2.8 seconds on an M1 Mac with max_tokens=512.

Features

  • Semantic boundary detection: Splits text at natural breakpoints (paragraphs, sentences, clauses) rather than arbitrary character counts
  • Token-aware chunking: Uses tiktoken to ensure chunks stay within specified token limits
  • Hierarchical splitting: Progressively splits at different semantic levels (paragraphs → sentences → clauses → words)
  • Delimiter preservation: Maintains delimiters in the output for natural text flow
  • Intelligent merging: Combines smaller chunks to maximize token usage without exceeding limits
  • Flexible output: Returns either text chunks or interval boundaries

Quick Start

pip install tikchunk

import tiktoken
from tikchunk import Chunker

# Initialize with your text and encoding
encoding = tiktoken.get_encoding("cl100k_base")
text = "Your long document text here..."

# Create chunker with desired max tokens per chunk
chunker = Chunker(
    encoding=encoding,
    text=text,
    max_tokens=512
)

# Generate chunks
for chunk in chunker.chunk():
    print(chunk)
    print("---")

How It Works

The chunker uses a priority-based splitting strategy:

  1. Priority 0: Paragraph breaks (\n\n\n, \r\n\r\n\r\n, \n\n, \r\n\r\n)
  2. Priority 1: Line breaks and dividers (\n---\n, \n===\n, \n***\n, \r\n, \n, \r)
  3. Priority 2: Sentence endings (. , ! , ? , ., !, ?)
  4. Priority 3: Clause separators (; , : , ;, :, --, , , --, , )
  5. Priority 4: Phrase separators (, , ,, ..., )
  6. Priority 5: Word boundaries ( )

When a chunk exceeds max_tokens, the algorithm:

  1. Splits at the current priority level
  2. Merges adjacent segments up to the token limit
  3. Recursively processes any remaining oversized chunks at the next priority level

This ensures text is split at the most semantically meaningful boundaries possible while staying within token constraints.

API Reference

Chunker

Parameters:

  • encoding (tiktoken.Encoding): The tokenizer encoding to use
  • text (str): The text to chunk
  • max_tokens (int, optional): Maximum tokens per chunk. Default: 512
  • as_text (bool, optional): Return text chunks (True) or Interval objects (False). Default: True

Methods:

  • chunk(): Returns a generator yielding text chunks or Intervals

chunk(text, tok_prefix_sum, max_tokens)

Low-level function for custom chunking workflows.

Parameters:

  • text (str): Text to chunk
  • tok_prefix_sum (np.ndarray): Prefix sum array of token positions
  • max_tokens (int): Maximum tokens per chunk

Returns:

  • list[Interval]: List of text intervals representing chunks

Use Cases

  • RAG pipelines: Split documents for vector database ingestion
  • Long-context processing: Break documents into manageable segments for LLM processing
  • Document analysis: Create semantically coherent text segments for analysis
  • Context window management: Ensure text fits within model token limits

Advanced Usage

Getting Interval Boundaries

chunker = Chunker(
    encoding=encoding,
    text=text,
    max_tokens=512,
    as_text=False  # Return Interval objects
)

for interval in chunker.chunk():
    print(f"Chunk from {interval.start} to {interval.end}")
    print(text[interval.start:interval.end])

Implementation Details

  • Uses regex-based pattern matching for efficient delimiter detection
  • Employs numpy for fast token prefix sum calculations
  • Implements a stack-based iterative approach to avoid recursion performance costs
  • Preserves delimiters to maintain natural text readability

License

MIT License

Contributing

Contributions welcome! Please ensure any changes maintain semantic splitting behavior and include appropriate tests. As this project is open source, ensure minimum coverage constraints are met, and include relevant property tests created via hypothesis

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tikchunk-0.0.5.tar.gz (46.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tikchunk-0.0.5-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file tikchunk-0.0.5.tar.gz.

File metadata

  • Download URL: tikchunk-0.0.5.tar.gz
  • Upload date:
  • Size: 46.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for tikchunk-0.0.5.tar.gz
Algorithm Hash digest
SHA256 e1578fb7ac7015a7d6d76fa71a223f0400e99cb9d394822d32b0a688fac23bba
MD5 d592101bd8a30b583cbbc0926c04cb01
BLAKE2b-256 93c2095020853143674fa53827bb0a7ef254b5a839edda7a88a3ed87303dbd2e

See more details on using hashes here.

File details

Details for the file tikchunk-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: tikchunk-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 5.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for tikchunk-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 651c24ca89455dc8be4b606d7aee95aaa20cf9336e76839fe9ce65b30a1ff83a
MD5 8848f4260030c250028e754d6be76557
BLAKE2b-256 50c5a183421db7e881231ae7cedcabc81e907a7fa76defde7c84f49fe1d2c673

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page