Skip to main content

A smart multilingual text chunker for LLMs, RAG, and beyond.

Project description

📦 Chunklet: Smart Multilingual Text Chunker

Chunklet Logo chunklet

PyPI - Python Version PyPI Stability License: MIT

Chunk smarter, not harder — built for LLMs, RAG pipelines, and beyond.
Author: speedyk_005
Version: 1.1.0
License: MIT

📌 What’s New in v1.1.0

  • 🔄 Primary sentence splitter replaced: Replaced sentsplit with pysbd for improved sentence boundary detection.
  • Language Detection Upgrade: Migrated from langid to py3langid, delivering identical accuracy but ~40× faster classification speeds in benchmarks, significantly reducing multilingual processing latency.
  • 🧵 Parallel Processing Optimization: Replaced mpire.WorkerPool with Python’s built-in concurrent.futures.ThreadPoolExecutor for lower overhead and improved performance on small to medium-sized batches.
  • 🔧 Multiple Refactor Steps: Core code reorganized for clarity, maintainability, and performance.

🔥 Why Chunklet?

Feature Why it’s elite
⛓️ Hybrid Mode Combines token + sentence limits with guaranteed overlap — rare even in commercial stacks.
🌐 Multilingual Fallbacks Pysbd > SentenceSplitter > Regex, with dynamic confidence detection.
Clause-Level Overlap overlap_percent now operates at the clause level, preserving semantic flow across chunks using , ; … logic.
Parallel Batch Processing Efficient parallel processing with ThreadPoolExecutor, optimized for low overhead on small batches.
♻️ LRU Caching Smart memoization via functools.lru_cache.
🪄 Pluggable Token Counters Swap in GPT-2, BPE, or your own tokenizer.

🧩 Chunking Modes

Pick your flavor:

  • "sentence" — chunk by sentence count only
  • "token" — chunk by token count only
  • "hybrid" — sentence + token thresholds respected with guaranteed overlap

🌍 Language Support (36+)

  • Primary (Pysbd): Supports a wide range of languages for highly accurate sentence boundary detection. (e.g., ar, pl, ja, da, zh, hy, my, ur, fr, it, fa, bg, el, mr, ru, nl, es, am, kk, en, hi, de)
  • Secondary (SentenceSplitter): Provides support for additional languages not covered by Pysbd. (e.g., pt, no, cs, sk, lv, ro, ca, sl, sv, fi, lt, tr, hu, is)
  • Fallback (Smart Regex): For any language not explicitly supported by the above, a smart regex-based splitter is used as a reliable fallback.

🌊 Internal Workflow

Here's a high-level overview of Chunklet's internal processing flow:

graph TD
    A1["Chunk"]
    A2["Batch (threaded)"]
    A3["Preview Sentences"]

    A1 --> B["Process Text"]
    A2 --> B
    A3 --> D["Split Text into Sentences"]

    B --> E{"Language == Auto?"}
    E -- Yes --> F["Detect Text Language"]
    E -- No --> G

    F --> G["Split Text into Sentences"]
    G --> H["Group Sentences into Chunks"]
    H --> I["Apply Overlap Between Chunks"]
    I --> H
    H --> J["Return Final Chunks"]

📦 Installation

Install chunklet easily from PyPI:

pip install chunklet

To install from source for development:

git clone https://github.com/Speedyk-005/chunklet.git
cd chunklet
pip install -e .

✨ Getting started

Get started with chunklet in just a few lines of code. Here’s a basic example of how to chunk a text by sentences:

from chunklet import Chunklet

# Sample text
text = (
    "She loves cooking. He studies AI. The weather is great. "
    "We play chess. Books are fun. Robots are learning."
)

# Initialize Chunklet
chunker = Chunklet()

# 1. Preview the sentences
sentences = chunker.preview_sentences(text)
print("Sentences to be chunked:")
for s in sentences:
    print(f"- {s}")

# 2. Chunk the text by sentences
chunks = chunker.chunk(text, mode="sentence", max_sentences=2)

# Print the chunks
print("\nChunks:")
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} \")
    print(chunk)

This will output:

Sentences to be chunked:
- She loves cooking.
- He studies AI.
- The weather is great.
- We play chess.
- Books are fun.
- Robots are learning.

Chunks:
--- Chunk 1 ---
She loves cooking.
He studies AI.
--- Chunk 2 ---
The weather is great.
We play chess.
--- Chunk 3 ---
Books are fun.
Robots are learning.

Advanced Usage

Custom Token Counter

This example shows how to use a custom function to count tokens, which is essential for token-based chunking.

Click to see Custom Token Counter Example
from chunklet import Chunklet

# Define a custom token counter
def simple_token_counter(text: str) -> int:
    return len(text.split())

# Initialize Chunklet with the custom counter
chunker = Chunklet(token_counter=simple_token_counter)

text = "This is a sample text to demonstrate custom token counting."

# Chunk by tokens
chunks = chunker.chunk(text, mode="token", max_tokens=5)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk)

Hybrid Mode with Overlap

Combine sentence and token limits with overlap to maintain context between chunks.

Click to see Hybrid Mode with Overlap Example
from chunklet import Chunklet

def simple_token_counter(text: str) -> int:
    return len(text.split())

chunker = Chunklet(token_counter=simple_token_counter)

text = (
    "This is a long text to demonstrate hybrid chunking. "
    "It combines both sentence and token limits for flexible chunking. "
    "Overlap helps maintain context between chunks by repeating some clauses."
)

# Chunk with both sentence and token limits, and 20% overlap
chunks = chunker.chunk(
    text,
    mode="hybrid",
    max_sentences=2,
    max_tokens=15,
    overlap_percent=20
)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk)

Batch Processing

Process multiple documents in parallel for improved performance.

Click to see Batch Processing Example
from chunklet import Chunklet

texts = [
    "First document. It has two sentences.",
    "Second document. This one is slightly longer.",
    "Third document. A final one to make a batch.",
]

chunker = Chunklet()

# Process texts in parallel
results = chunker.batch_chunk(texts, mode="sentence", max_sentences=1, n_jobs=2)

for i, doc_chunks in enumerate(results):
    print(f"--- Document {i+1} ---")
    for j, chunk in enumerate(doc_chunks):
        print(f"Chunk {j+1}: {chunk}")

📊 Benchmarks

Performance metrics for various chunking modes and language processing.

Chunk Modes

Mode Avg. time (s)
sentence 0.0173
token 0.0177
hybrid 0.0179

Batch Chunking

Metric Value
Iterations 256
Number of texts 3
Total text length (chars) 81175
Avg. time (s) 0.1846

For detailed benchmark implementation, refer to the bench.py script.


🧪 Planned Features

  • CLI interface with --file, --mode, --overlap, etc.
  • Named chunking presets (conceptually "all", "random_gap") for downstream control
  • code splitting based on interest point
  • PDF splitter with metadata

💡Projects that inspire me

Tool Description
Semchunk Semantic-aware chunking using transformer embeddings.
CintraAI Code Chunker AST-based code chunker for intelligent code splitting.

🤝 Contributing

  1. Fork this repo
  2. Create a new feature branch
  3. Code like a star
  4. Submit a pull request

📜 Changelog

See the CHANGELOG.md for a history of changes.


📜 License

MIT License. Use freely, modify boldly, and credit the legend (me. Just kidding!)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunklet-1.1.0.post1.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

chunklet-1.1.0.post1-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file chunklet-1.1.0.post1.tar.gz.

File metadata

  • Download URL: chunklet-1.1.0.post1.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for chunklet-1.1.0.post1.tar.gz
Algorithm Hash digest
SHA256 ae88351efbbdab4adaa5a301e1fe734d99b043890b3d80876f2dd3263487be5a
MD5 22b2857e4aa398b3b766cafad8ddd119
BLAKE2b-256 e0ede3b1bd6a9659a0433a660a49bc7b46ec3f0b867dcd07829bbf3209d2f5bc

See more details on using hashes here.

File details

Details for the file chunklet-1.1.0.post1-py3-none-any.whl.

File metadata

  • Download URL: chunklet-1.1.0.post1-py3-none-any.whl
  • Upload date:
  • Size: 13.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for chunklet-1.1.0.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 93c7d6855e40843439d02bea799528d67a5c25279e18def04175902865e9631d
MD5 1ca12868117fa56f0518c50ebdb19712
BLAKE2b-256 58c418713b6ad2560ee24d4435d56b523ae098131b453e35f22593c81f18f0d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page