A smart multilingual text chunker for LLMs, RAG, and beyond.

These details have not been verified by PyPI

Project links

Project description

📦 Chunklet: Smart Multilingual Text Chunker

Version Stability License: MIT

Chunk smarter, not harder — built for LLMs, RAG pipelines, and beyond.
Author: Speed k.
Version: 1.0.4 (🎉 first stable release)
License: MIT

🚀 What’s New in v1.0.4 (Stable)

✅ Stable Release: v1.0.4 marks the first fully stable version after extensive refactoring.
🔄Multiple Refactor Steps: Core code reorganized for clarity, maintainability, and performance.
➿ True Clause-Level Overlap: Overlap now occurs on natural clause boundaries (commas, semicolons, etc.) instead of just sentences, preserving semantic flow better.
🛠️ Improved Chunking Logic: Enhanced fallback splitters and overlap calculations to handle edge cases gracefully.
⚡ Optimized Batch Processing: Parallel chunking now consistently respects token counters and offsets.
🧪 Expanded Test Suite: Comprehensive tests added for multilingual support, caching, and chunk correctness.
🧹 Cleaner Output: Logging filters and redundant docstrings removed to reduce noise during runs.

🔥 Why Chunklet?

Feature	Why it’s elite
⛓️ Hybrid Mode	Combines token + sentence limits with guaranteed overlap — rare even in commercial stacks.
🌐 Multilingual Fallbacks	CRF > Moses > Regex, with dynamic confidence detection.
➿ Clause-Level Overlap	`overlap_percent` now operates at the clause level, preserving semantic flow across chunks using `, ; …` logic.
⚡ Parallel Batch Processing	Multi-core acceleration with `mpire`.
♻️ LRU Caching	Smart memoization via `functools.lru_cache`.
🪄 Pluggable Token Counters	Swap in GPT-2, BPE, or your own tokenizer.

🧩 Chunking Modes

Pick your flavor:

"sentence" — chunk by sentence count only
"token" — chunk by token count only
"hybrid" — sentence + token thresholds respected with guaranteed overlap

📦 Installation

Install chunklet easily from PyPI:

pip install chunklet

To install from source for development:

git clone https://github.com/speed40/chunklet.git
cd chunklet
pip install -e .

💡 Example: Hybrid Mode

from chunklet import Chunklet

def word_token_counter(text: str) -> int:
    return len(text.split())

chunker = Chunklet(verbose=True, use_cache=True, token_counter=word_token_counter)

sample = """
This is a long document about AI. It discusses neural networks and deep learning.
The future is exciting. Ethics must be considered. Let’s build wisely.
"""

chunks = chunker.chunk(
    text=sample,
    mode="hybrid",
    max_tokens=20,
    max_sentences=5,
    overlap_percent=30
)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk)

🌀 Batch Chunking (Parallel)

texts = [
    "First document sentence. Second sentence.",
    "Another one. Slightly longer. A third one here.",
    "Final doc with multiple lines. Great for testing chunk overlap."
]

results = chunker.batch_chunk(
    texts=texts,
    mode="hybrid",
    max_tokens=15,
    max_sentences=4,
    overlap_percent=20,
    n_jobs=2
)

for i, doc_chunks in enumerate(results):
    print(f"\n## Document {i+1}")
    for j, chunk in enumerate(doc_chunks):
        print(f"Chunk {j+1}:\n{chunk}")

⚙️ GPT-2 Token Count Support

from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def gpt2_token_count(text: str) -> int:
    return len(tokenizer.encode(text))

chunker = Chunklet(token_counter=gpt2_token_count)

🧪 Planned Features

[ ] PDF splitter with metadata [ ] code splitting based on interest point [ ] CLI interface with --file, --mode, --overlap, etc. [ ] Named chunking presets: "all", "random_gap" for downstream control

🌍 Language Support (30+)

CRF-based: en, fr, de, it, ru, zh, ja, ko, pt, tr, etc.
Heuristic-based: es, nl, da, fi, no, sv, cs, hu, el, ro, etc.
Fallback: All other languages via smart regex

💡Projects that inspire me

Tool	Description
Semchunk	Semantic-aware chunking using transformer embeddings.
CintraAI Code Chunker	AST-based code chunker for intelligent code splitting.

🤝 Contributing

Fork this repo
Create a new feature branch
Code like a star
Submit a pull request

📜 License

MIT License. Use freely, modify boldly, and credit the legend (me. Just kidding!)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.0.post1

Aug 13, 2025

1.1.0

Aug 13, 2025

1.0.4.post5

Aug 2, 2025

1.0.4.post4

Jul 25, 2025

1.0.4.post3

Jul 25, 2025

1.0.4.post2

Jul 25, 2025

This version

1.0.4.post1

Jul 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

chunklet-1.0.4.post1-py3-none-any.whl (8.9 kB view details)

Uploaded Jul 25, 2025 Python 3

File details

Details for the file chunklet-1.0.4.post1-py3-none-any.whl.

File metadata

Download URL: chunklet-1.0.4.post1-py3-none-any.whl
Upload date: Jul 25, 2025
Size: 8.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for chunklet-1.0.4.post1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0ea3951adbec72635c23928ed79438688bcc4cc4d64b06d17798710938a0f0a0`
MD5	`950d9476a14c2a07388bef9f00f833fe`
BLAKE2b-256	`a08d55b63c3ecf343b41e9b7869dbaef218729367a7c97e9b8046d94e4f8213d`

See more details on using hashes here.

chunklet 1.0.4.post1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

📦 Chunklet: Smart Multilingual Text Chunker

🚀 What’s New in v1.0.4 (Stable)

🔥 Why Chunklet?

🧩 Chunking Modes

📦 Installation

💡Projects that inspire me

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes