Skip to main content

A smart multilingual text chunker for LLMs, RAG, and beyond.

Project description

📦 Chunklet: Smart Multilingual Text Chunker

Chunklet Logo chunklet

PyPI - Python Version PyPI Stability License: MIT

Chunk smarter, not harder — built for LLMs, RAG pipelines, and beyond.
Author: speedyk_005
Version: 1.1.0
License: MIT

📌 What’s New in v1.1.0

  • 🔄 Primary sentence splitter replaced: Replaced sentsplit with pysbd for improved sentence boundary detection.
  • Language Detection Upgrade: Migrated from langid to py3langid, delivering identical accuracy but ~40× faster classification speeds in benchmarks, significantly reducing multilingual processing latency.
  • 🧵 Parallel Processing Optimization: Replaced mpire.WorkerPool with Python’s built-in concurrent.futures.ThreadPoolExecutor for lower overhead and improved performance on small to medium-sized batches.
  • 🔧 Multiple Refactor Steps: Core code reorganized for clarity, maintainability, and performance.


🔥 Why Chunklet?

Feature Why it’s elite
⛓️ Hybrid Mode Combines token + sentence limits with guaranteed overlap — rare even in commercial stacks.
🌐 Multilingual Fallbacks Pysbd > SentenceSplitter > Regex, with dynamic confidence detection.
Clause-Level Overlap overlap_percent now operates at the clause level, preserving semantic flow across chunks using , ; … logic.
Parallel Batch Processing Efficient parallel processing with ThreadPoolExecutor, optimized for low overhead on small batches.
♻️ LRU Caching Smart memoization via functools.lru_cache.
🪄 Pluggable Token Counters Swap in GPT-2, BPE, or your own tokenizer.

🧩 Chunking Modes

Pick your flavor:

  • "sentence" — chunk by sentence count only
  • "token" — chunk by token count only
  • "hybrid" — sentence + token thresholds respected with guaranteed overlap

🌊 Internal Workflow

Here's a high-level overview of Chunklet's internal processing flow:

graph TD
    A1["Chunk"]
    A2["Batch (threaded)"]
    A3["Preview Sentences"]

    A1 --> B["Process Text"]
    A2 --> B
    A3 --> D["Split Text into Sentences"]

    B --> E{"Language == Auto?"}
    E -- Yes --> F["Detect Text Language"]
    E -- No --> G

    F --> G["Split Text into Sentences"]
    G --> H["Group Sentences into Chunks"]
    H --> I["Apply Overlap Between Chunks"]
    I --> H
    H --> J["Return Final Chunks"]

📦 Installation

Install chunklet easily from PyPI:

pip install chunklet

To install from source for development:

git clone https://github.com/Speedyk-005/chunklet.git
cd chunklet
pip install -e .

✨ Getting started

Get started with chunklet in just a few lines of code. Here’s a basic example of how to chunk a text by sentences:

from chunklet import Chunklet

# Sample text
text = (
    "She loves cooking. He studies AI. The weather is great. "
    "We play chess. Books are fun. Robots are learning."
)

# Initialize Chunklet
chunker = Chunklet()

# 1. Preview the sentences
sentences = chunker.preview_sentences(text)
print("Sentences to be chunked:")
for s in sentences:
    print(f"- {s}")

# 2. Chunk the text by sentences
chunks = chunker.chunk(text, mode="sentence", max_sentences=2)

# Print the chunks
print("\nChunks:")
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} \")
    print(chunk)

This will output:

Sentences to be chunked:
- She loves cooking.
- He studies AI.
- The weather is great.
- We play chess.
- Books are fun.
- Robots are learning.

Chunks:
--- Chunk 1 ---
She loves cooking.
He studies AI.
--- Chunk 2 ---
The weather is great.
We play chess.
--- Chunk 3 ---
Books are fun.
Robots are learning.

Advanced Usage

Custom Token Counter

This example shows how to use a custom function to count tokens, which is essential for token-based chunking.

Click to see Custom Token Counter Example
from chunklet import Chunklet

# Define a custom token counter
def simple_token_counter(text: str) -> int:
    return len(text.split())

# Initialize Chunklet with the custom counter
chunker = Chunklet(token_counter=simple_token_counter)

text = "This is a sample text to demonstrate custom token counting."

# Chunk by tokens
chunks = chunker.chunk(text, mode="token", max_tokens=5)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk)

Hybrid Mode with Overlap

Combine sentence and token limits with overlap to maintain context between chunks.

Click to see Hybrid Mode with Overlap Example
from chunklet import Chunklet

def simple_token_counter(text: str) -> int:
    return len(text.split())

chunker = Chunklet(token_counter=simple_token_counter)

text = (
    "This is a long text to demonstrate hybrid chunking. "
    "It combines both sentence and token limits for flexible chunking. "
    "Overlap helps maintain context between chunks by repeating some clauses."
)

# Chunk with both sentence and token limits, and 20% overlap
chunks = chunker.chunk(
    text,
    mode="hybrid",
    max_sentences=2,
    max_tokens=15,
    overlap_percent=20
)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk)

Batch Processing

Process multiple documents in parallel for improved performance.

Click to see Batch Processing Example
from chunklet import Chunklet

texts = [
    "First document. It has two sentences.",
    "Second document. This one is slightly longer.",
    "Third document. A final one to make a batch.",
]

chunker = Chunklet()

# Process texts in parallel
results = chunker.batch_chunk(texts, mode="sentence", max_sentences=1, n_jobs=2)

for i, doc_chunks in enumerate(results):
    print(f"--- Document {i+1} ---")
    for j, chunk in enumerate(doc_chunks):
        print(f"Chunk {j+1}: {chunk}")

📊 Benchmarks

Performance metrics for various chunking modes and language processing.

Chunk Modes

Mode Time (s)
sentence 0.0173
token 0.0177
hybrid 0.0179

Various Languages

Language Time (s)
English (pysbd) 0.0167
Catalan (SentenceSplitter) 0.0189
Haitian Creole (Regex fallback) 0.0158

Batch Chunking

Metric Value
Iterations 256
Number of texts 3
Total text length (chars) 81175
Time (s) 0.1846

For detailed benchmark implementation, refer to the bench.py script.


🧪 Planned Features

  • CLI interface with --file, --mode, --overlap, etc.
  • Named chunking presets (conceptually "all", "random_gap") for downstream control
  • code splitting based on interest point
  • PDF splitter with metadata

🌍 Language Support (36+)

  • Primary (Pysbd): Supports a wide range of languages for highly accurate sentence boundary detection. (e.g., ar, pl, ja, da, zh, hy, my, ur, fr, it, fa, bg, el, mr, ru, nl, es, am, kk, en, hi, de)
  • Secondary (SentenceSplitter): Provides support for additional languages not covered by Pysbd. (e.g., pt, no, cs, sk, lv, ro, ca, sl, sv, fi, lt, tr, hu, is)
  • Fallback (Smart Regex): For any language not explicitly supported by the above, a smart regex-based splitter is used as a reliable fallback.

💡Projects that inspire me

Tool Description
Semchunk Semantic-aware chunking using transformer embeddings.
CintraAI Code Chunker AST-based code chunker for intelligent code splitting.

🤝 Contributing

  1. Fork this repo
  2. Create a new feature branch
  3. Code like a star
  4. Submit a pull request

📜 Changelog

See the CHANGELOG.md for a history of changes.


📜 License

MIT License. Use freely, modify boldly, and credit the legend (me. Just kidding!)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunklet-1.1.0.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

chunklet-1.1.0-py3-none-any.whl (13.7 kB view details)

Uploaded Python 3

File details

Details for the file chunklet-1.1.0.tar.gz.

File metadata

  • Download URL: chunklet-1.1.0.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for chunklet-1.1.0.tar.gz
Algorithm Hash digest
SHA256 21f4012bc35b869a49a463c26fe4bc3aec45d87a6229267943668a2a6bd4cf3e
MD5 bb3a667f52727313b2c9a223b8b2975d
BLAKE2b-256 43ede9c2d615b69c1b6ee9273c4c9c8e27bd088cf7acf6407fe7c81382338909

See more details on using hashes here.

File details

Details for the file chunklet-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: chunklet-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for chunklet-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aa26ecb8154ca4a7b5da08bc59a0cfce9c50957149bf722af6f74959a50a689c
MD5 94923a2de5c59f9f87efad82dbfc7f85
BLAKE2b-256 a2e71db20fd310a28b856465d40e90f2a21960ef73f993b98239ddfb077a1b60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page