A smart multilingual text chunker for LLMs, RAG, and beyond.

These details have not been verified by PyPI

Project links

Project description

📦 Chunklet: Smart Multilingual Text Chunker

chunklet

Chunk smarter, not harder — built for LLMs, RAG pipelines, and beyond.
Author: speedyk_005
Version: 1.1.0
License: MIT

📌 What’s New in v1.1.0

🔄 Primary sentence splitter replaced: Replaced sentsplit with pysbd for improved sentence boundary detection.
⚡ Language Detection Upgrade: Migrated from langid to py3langid, delivering identical accuracy but ~40× faster classification speeds in benchmarks, significantly reducing multilingual processing latency.
🧵 Parallel Processing Optimization: Replaced mpire.WorkerPool with Python’s built-in concurrent.futures.ThreadPoolExecutor for lower overhead and improved performance on small to medium-sized batches.
🔧 Multiple Refactor Steps: Core code reorganized for clarity, maintainability, and performance.

🔥 Why Chunklet?

Feature	Why it’s elite
⛓️ Hybrid Mode	Combines token + sentence limits with guaranteed overlap — rare even in commercial stacks.
🌐 Multilingual Fallbacks	Pysbd > SentenceSplitter > Regex, with dynamic confidence detection.
➿ Clause-Level Overlap	`overlap_percent` now operates at the clause level, preserving semantic flow across chunks using `, ; …` logic.
⚡ Parallel Batch Processing	Efficient parallel processing with `ThreadPoolExecutor`, optimized for low overhead on small batches.
♻️ LRU Caching	Smart memoization via `functools.lru_cache`.
🪄 Pluggable Token Counters	Swap in GPT-2, BPE, or your own tokenizer.

🧩 Chunking Modes

Pick your flavor:

"sentence" — chunk by sentence count only
"token" — chunk by token count only
"hybrid" — sentence + token thresholds respected with guaranteed overlap

🌊 Internal Workflow

Here's a high-level overview of Chunklet's internal processing flow:

graph TD
    A1["Chunk"]
    A2["Batch (threaded)"]
    A3["Preview Sentences"]

    A1 --> B["Process Text"]
    A2 --> B
    A3 --> D["Split Text into Sentences"]

    B --> E{"Language == Auto?"}
    E -- Yes --> F["Detect Text Language"]
    E -- No --> G

    F --> G["Split Text into Sentences"]
    G --> H["Group Sentences into Chunks"]
    H --> I["Apply Overlap Between Chunks"]
    I --> H
    H --> J["Return Final Chunks"]

📦 Installation

Install chunklet easily from PyPI:

pip install chunklet

To install from source for development:

git clone https://github.com/Speedyk-005/chunklet.git
cd chunklet
pip install -e .

✨ Getting started

Get started with chunklet in just a few lines of code. Here’s a basic example of how to chunk a text by sentences:

from chunklet import Chunklet

# Sample text
text = (
    "She loves cooking. He studies AI. The weather is great. "
    "We play chess. Books are fun. Robots are learning."
)

# Initialize Chunklet
chunker = Chunklet()

# 1. Preview the sentences
sentences = chunker.preview_sentences(text)
print("Sentences to be chunked:")
for s in sentences:
    print(f"- {s}")

# 2. Chunk the text by sentences
chunks = chunker.chunk(text, mode="sentence", max_sentences=2)

# Print the chunks
print("\nChunks:")
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} \")
    print(chunk)

This will output:

Sentences to be chunked:
- She loves cooking.
- He studies AI.
- The weather is great.
- We play chess.
- Books are fun.
- Robots are learning.

Chunks:
--- Chunk 1 ---
She loves cooking.
He studies AI.
--- Chunk 2 ---
The weather is great.
We play chess.
--- Chunk 3 ---
Books are fun.
Robots are learning.

Advanced Usage

Custom Token Counter

This example shows how to use a custom function to count tokens, which is essential for token-based chunking.

Click to see Custom Token Counter Example

from chunklet import Chunklet

# Define a custom token counter
def simple_token_counter(text: str) -> int:
    return len(text.split())

# Initialize Chunklet with the custom counter
chunker = Chunklet(token_counter=simple_token_counter)

text = "This is a sample text to demonstrate custom token counting."

# Chunk by tokens
chunks = chunker.chunk(text, mode="token", max_tokens=5)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk)

Hybrid Mode with Overlap

Combine sentence and token limits with overlap to maintain context between chunks.

Click to see Hybrid Mode with Overlap Example

from chunklet import Chunklet

def simple_token_counter(text: str) -> int:
    return len(text.split())

chunker = Chunklet(token_counter=simple_token_counter)

text = (
    "This is a long text to demonstrate hybrid chunking. "
    "It combines both sentence and token limits for flexible chunking. "
    "Overlap helps maintain context between chunks by repeating some clauses."
)

# Chunk with both sentence and token limits, and 20% overlap
chunks = chunker.chunk(
    text,
    mode="hybrid",
    max_sentences=2,
    max_tokens=15,
    overlap_percent=20
)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk)

Batch Processing

Process multiple documents in parallel for improved performance.

Click to see Batch Processing Example

from chunklet import Chunklet

texts = [
    "First document. It has two sentences.",
    "Second document. This one is slightly longer.",
    "Third document. A final one to make a batch.",
]

chunker = Chunklet()

# Process texts in parallel
results = chunker.batch_chunk(texts, mode="sentence", max_sentences=1, n_jobs=2)

for i, doc_chunks in enumerate(results):
    print(f"--- Document {i+1} ---")
    for j, chunk in enumerate(doc_chunks):
        print(f"Chunk {j+1}: {chunk}")

📊 Benchmarks

Performance metrics for various chunking modes and language processing.

Chunk Modes

Mode	Time (s)
sentence	0.0173
token	0.0177
hybrid	0.0179

Various Languages

Language	Time (s)
English (pysbd)	0.0167
Catalan (SentenceSplitter)	0.0189
Haitian Creole (Regex fallback)	0.0158

Batch Chunking

Metric	Value
Iterations	256
Number of texts	3
Total text length (chars)	81175
Time (s)	0.1846

For detailed benchmark implementation, refer to the bench.py script.

🧪 Planned Features

CLI interface with --file, --mode, --overlap, etc.
Named chunking presets (conceptually "all", "random_gap") for downstream control
code splitting based on interest point
PDF splitter with metadata

🌍 Language Support (36+)

Primary (Pysbd): Supports a wide range of languages for highly accurate sentence boundary detection. (e.g., ar, pl, ja, da, zh, hy, my, ur, fr, it, fa, bg, el, mr, ru, nl, es, am, kk, en, hi, de)
Secondary (SentenceSplitter): Provides support for additional languages not covered by Pysbd. (e.g., pt, no, cs, sk, lv, ro, ca, sl, sv, fi, lt, tr, hu, is)
Fallback (Smart Regex): For any language not explicitly supported by the above, a smart regex-based splitter is used as a reliable fallback.

💡Projects that inspire me

Tool	Description
Semchunk	Semantic-aware chunking using transformer embeddings.
CintraAI Code Chunker	AST-based code chunker for intelligent code splitting.

🤝 Contributing

Fork this repo
Create a new feature branch
Code like a star
Submit a pull request

📜 Changelog

See the CHANGELOG.md for a history of changes.

📜 License

MIT License. Use freely, modify boldly, and credit the legend (me. Just kidding!)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.0.post1

Aug 13, 2025

This version

1.1.0

Aug 13, 2025

1.0.4.post5

Aug 2, 2025

1.0.4.post4

Jul 25, 2025

1.0.4.post3

Jul 25, 2025

1.0.4.post2

Jul 25, 2025

1.0.4.post1

Jul 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunklet-1.1.0.tar.gz (18.7 kB view details)

Uploaded Aug 13, 2025 Source

Built Distribution

chunklet-1.1.0-py3-none-any.whl (13.7 kB view details)

Uploaded Aug 13, 2025 Python 3

File details

Details for the file chunklet-1.1.0.tar.gz.

File metadata

Download URL: chunklet-1.1.0.tar.gz
Upload date: Aug 13, 2025
Size: 18.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for chunklet-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`21f4012bc35b869a49a463c26fe4bc3aec45d87a6229267943668a2a6bd4cf3e`
MD5	`bb3a667f52727313b2c9a223b8b2975d`
BLAKE2b-256	`43ede9c2d615b69c1b6ee9273c4c9c8e27bd088cf7acf6407fe7c81382338909`

See more details on using hashes here.

File details

Details for the file chunklet-1.1.0-py3-none-any.whl.

File metadata

Download URL: chunklet-1.1.0-py3-none-any.whl
Upload date: Aug 13, 2025
Size: 13.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for chunklet-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aa26ecb8154ca4a7b5da08bc59a0cfce9c50957149bf722af6f74959a50a689c`
MD5	`94923a2de5c59f9f87efad82dbfc7f85`
BLAKE2b-256	`a2e71db20fd310a28b856465d40e90f2a21960ef73f993b98239ddfb077a1b60`

See more details on using hashes here.

chunklet 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

📦 Chunklet: Smart Multilingual Text Chunker

📌 What’s New in v1.1.0

🔥 Why Chunklet?

🧩 Chunking Modes

🌊 Internal Workflow

📦 Installation

✨ Getting started

Advanced Usage

Custom Token Counter

Hybrid Mode with Overlap

Batch Processing

📊 Benchmarks

Chunk Modes

Various Languages

Batch Chunking

🧪 Planned Features

🌍 Language Support (36+)

💡Projects that inspire me

🤝 Contributing

📜 Changelog

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes