A smart multilingual text chunker for LLMs, RAG, and beyond.
Project description
📦 Chunklet: Smart Multilingual Text Chunker
Chunk smarter, not harder — built for LLMs, RAG pipelines, and beyond.
Author: speedyk_005
Version: 1.1.0
License: MIT
📌 What’s New in v1.1.0
- 🔄 Primary sentence splitter replaced: Replaced
sentsplit
withpysbd
for improved sentence boundary detection. - ⚡ Language Detection Upgrade: Migrated from
langid
topy3langid
, delivering identical accuracy but ~40× faster classification speeds in benchmarks, significantly reducing multilingual processing latency. - 🧵 Parallel Processing Optimization: Replaced
mpire.WorkerPool
with Python’s built-inconcurrent.futures.ThreadPoolExecutor
for lower overhead and improved performance on small to medium-sized batches. - 🔧 Multiple Refactor Steps: Core code reorganized for clarity, maintainability, and performance.
🔥 Why Chunklet?
Feature | Why it’s elite |
---|---|
⛓️ Hybrid Mode | Combines token + sentence limits with guaranteed overlap — rare even in commercial stacks. |
🌐 Multilingual Fallbacks | Pysbd > SentenceSplitter > Regex, with dynamic confidence detection. |
➿ Clause-Level Overlap | overlap_percent now operates at the clause level, preserving semantic flow across chunks using , ; … logic. |
⚡ Parallel Batch Processing | Efficient parallel processing with ThreadPoolExecutor , optimized for low overhead on small batches. |
♻️ LRU Caching | Smart memoization via functools.lru_cache . |
🪄 Pluggable Token Counters | Swap in GPT-2, BPE, or your own tokenizer. |
🧩 Chunking Modes
Pick your flavor:
"sentence"
— chunk by sentence count only"token"
— chunk by token count only"hybrid"
— sentence + token thresholds respected with guaranteed overlap
🌍 Language Support (36+)
- Primary (Pysbd): Supports a wide range of languages for highly accurate sentence boundary detection. (e.g., ar, pl, ja, da, zh, hy, my, ur, fr, it, fa, bg, el, mr, ru, nl, es, am, kk, en, hi, de)
- Secondary (SentenceSplitter): Provides support for additional languages not covered by Pysbd. (e.g., pt, no, cs, sk, lv, ro, ca, sl, sv, fi, lt, tr, hu, is)
- Fallback (Smart Regex): For any language not explicitly supported by the above, a smart regex-based splitter is used as a reliable fallback.
🌊 Internal Workflow
Here's a high-level overview of Chunklet's internal processing flow:
graph TD
A1["Chunk"]
A2["Batch (threaded)"]
A3["Preview Sentences"]
A1 --> B["Process Text"]
A2 --> B
A3 --> D["Split Text into Sentences"]
B --> E{"Language == Auto?"}
E -- Yes --> F["Detect Text Language"]
E -- No --> G
F --> G["Split Text into Sentences"]
G --> H["Group Sentences into Chunks"]
H --> I["Apply Overlap Between Chunks"]
I --> H
H --> J["Return Final Chunks"]
📦 Installation
Install chunklet
easily from PyPI:
pip install chunklet
To install from source for development:
git clone https://github.com/Speedyk-005/chunklet.git
cd chunklet
pip install -e .
✨ Getting started
Get started with chunklet
in just a few lines of code. Here’s a basic example of how to chunk a text by sentences:
from chunklet import Chunklet
# Sample text
text = (
"She loves cooking. He studies AI. The weather is great. "
"We play chess. Books are fun. Robots are learning."
)
# Initialize Chunklet
chunker = Chunklet()
# 1. Preview the sentences
sentences = chunker.preview_sentences(text)
print("Sentences to be chunked:")
for s in sentences:
print(f"- {s}")
# 2. Chunk the text by sentences
chunks = chunker.chunk(text, mode="sentence", max_sentences=2)
# Print the chunks
print("\nChunks:")
for i, chunk in enumerate(chunks):
print(f"--- Chunk {i+1} \")
print(chunk)
This will output:
Sentences to be chunked:
- She loves cooking.
- He studies AI.
- The weather is great.
- We play chess.
- Books are fun.
- Robots are learning.
Chunks:
--- Chunk 1 ---
She loves cooking.
He studies AI.
--- Chunk 2 ---
The weather is great.
We play chess.
--- Chunk 3 ---
Books are fun.
Robots are learning.
Advanced Usage
Custom Token Counter
This example shows how to use a custom function to count tokens, which is essential for token-based chunking.
Click to see Custom Token Counter Example
from chunklet import Chunklet
# Define a custom token counter
def simple_token_counter(text: str) -> int:
return len(text.split())
# Initialize Chunklet with the custom counter
chunker = Chunklet(token_counter=simple_token_counter)
text = "This is a sample text to demonstrate custom token counting."
# Chunk by tokens
chunks = chunker.chunk(text, mode="token", max_tokens=5)
for i, chunk in enumerate(chunks):
print(f"--- Chunk {i+1} ---")
print(chunk)
Hybrid Mode with Overlap
Combine sentence and token limits with overlap to maintain context between chunks.
Click to see Hybrid Mode with Overlap Example
from chunklet import Chunklet
def simple_token_counter(text: str) -> int:
return len(text.split())
chunker = Chunklet(token_counter=simple_token_counter)
text = (
"This is a long text to demonstrate hybrid chunking. "
"It combines both sentence and token limits for flexible chunking. "
"Overlap helps maintain context between chunks by repeating some clauses."
)
# Chunk with both sentence and token limits, and 20% overlap
chunks = chunker.chunk(
text,
mode="hybrid",
max_sentences=2,
max_tokens=15,
overlap_percent=20
)
for i, chunk in enumerate(chunks):
print(f"--- Chunk {i+1} ---")
print(chunk)
Batch Processing
Process multiple documents in parallel for improved performance.
Click to see Batch Processing Example
from chunklet import Chunklet
texts = [
"First document. It has two sentences.",
"Second document. This one is slightly longer.",
"Third document. A final one to make a batch.",
]
chunker = Chunklet()
# Process texts in parallel
results = chunker.batch_chunk(texts, mode="sentence", max_sentences=1, n_jobs=2)
for i, doc_chunks in enumerate(results):
print(f"--- Document {i+1} ---")
for j, chunk in enumerate(doc_chunks):
print(f"Chunk {j+1}: {chunk}")
📊 Benchmarks
Performance metrics for various chunking modes and language processing.
Chunk Modes
Mode | Avg. time (s) |
---|---|
sentence | 0.0173 |
token | 0.0177 |
hybrid | 0.0179 |
Batch Chunking
Metric | Value |
---|---|
Iterations | 256 |
Number of texts | 3 |
Total text length (chars) | 81175 |
Avg. time (s) | 0.1846 |
For detailed benchmark implementation, refer to the bench.py
script.
🧪 Planned Features
- CLI interface with --file, --mode, --overlap, etc.
- Named chunking presets (conceptually "all", "random_gap") for downstream control
- code splitting based on interest point
- PDF splitter with metadata
💡Projects that inspire me
Tool | Description |
---|---|
Semchunk | Semantic-aware chunking using transformer embeddings. |
CintraAI Code Chunker | AST-based code chunker for intelligent code splitting. |
🤝 Contributing
- Fork this repo
- Create a new feature branch
- Code like a star
- Submit a pull request
📜 Changelog
See the CHANGELOG.md for a history of changes.
📜 License
MIT License. Use freely, modify boldly, and credit the legend (me. Just kidding!)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file chunklet-1.1.0.post1.tar.gz
.
File metadata
- Download URL: chunklet-1.1.0.post1.tar.gz
- Upload date:
- Size: 18.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
ae88351efbbdab4adaa5a301e1fe734d99b043890b3d80876f2dd3263487be5a
|
|
MD5 |
22b2857e4aa398b3b766cafad8ddd119
|
|
BLAKE2b-256 |
e0ede3b1bd6a9659a0433a660a49bc7b46ec3f0b867dcd07829bbf3209d2f5bc
|
File details
Details for the file chunklet-1.1.0.post1-py3-none-any.whl
.
File metadata
- Download URL: chunklet-1.1.0.post1-py3-none-any.whl
- Upload date:
- Size: 13.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
93c7d6855e40843439d02bea799528d67a5c25279e18def04175902865e9631d
|
|
MD5 |
1ca12868117fa56f0518c50ebdb19712
|
|
BLAKE2b-256 |
58c418713b6ad2560ee24d4435d56b523ae098131b453e35f22593c81f18f0d6
|