Skip to main content

A smart multilingual text chunker for LLMs, RAG, and beyond.

Project description

📦 Chunklet: Smart Multilingual Text Chunker

Version Stability License: MIT

Chunk smarter, not harder — built for LLMs, RAG pipelines, and beyond.
Author: Speedyk_005
Version: 1.0.4.post4 (🎉 first stable release)
License: MIT

🚀 What’s New in v1.0.4.post4 (with cli support)

  • Stable Release: v1.0.4 marks the first fully stable version after extensive refactoring.
  • 🔄Multiple Refactor Steps: Core code reorganized for clarity, maintainability, and performance.
  • True Clause-Level Overlap: Overlap now occurs on natural clause boundaries (commas, semicolons, etc.) instead of just sentences, preserving semantic flow better.
  • 🛠️ Improved Chunking Logic: Enhanced fallback splitters and overlap calculations to handle edge cases gracefully.
  • Optimized Batch Processing: Parallel chunking now consistently respects token counters and offsets.
  • 🧪 Expanded Test Suite: Comprehensive tests added for multilingual support, caching, and chunk correctness.
  • 🧹 Cleaner Output: Logging filters and redundant docstrings removed to reduce noise during runs.

🔥 Why Chunklet?

Feature Why it’s elite
⛓️ Hybrid Mode Combines token + sentence limits with guaranteed overlap — rare even in commercial stacks.
🌐 Multilingual Fallbacks CRF > Moses > Regex, with dynamic confidence detection.
Clause-Level Overlap overlap_percent now operates at the clause level, preserving semantic flow across chunks using , ; … logic.
Parallel Batch Processing Multi-core acceleration with mpire.
♻️ LRU Caching Smart memoization via functools.lru_cache.
🪄 Pluggable Token Counters Swap in GPT-2, BPE, or your own tokenizer.

🧩 Chunking Modes

Pick your flavor:

  • "sentence" — chunk by sentence count only
  • "token" — chunk by token count only
  • "hybrid" — sentence + token thresholds respected with guaranteed overlap

📦 Installation

Install chunklet easily from PyPI:

pip install chunklet

To install from source for development:

git clone https://github.com/speed40/chunklet.git
cd chunklet
pip install -e .

💡 Example: Hybrid Mode

from chunklet import Chunklet

def word_token_counter(text: str) -> int:
    return len(text.split())

chunker = Chunklet(verbose=True, use_cache=True, token_counter=word_token_counter)

sample = """
This is a long document about AI. It discusses neural networks and deep learning.
The future is exciting. Ethics must be considered. Let’s build wisely.
"""

chunks = chunker.chunk(
    text=sample,
    mode="hybrid",
    max_tokens=20,
    max_sentences=5,
    overlap_percent=30
)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk)

🌀 Batch Chunking (Parallel)

texts = [
    "First document sentence. Second sentence.",
    "Another one. Slightly longer. A third one here.",
    "Final doc with multiple lines. Great for testing chunk overlap."
]

results = chunker.batch_chunk(
    texts=texts,
    mode="hybrid",
    max_tokens=15,
    max_sentences=4,
    overlap_percent=20,
    n_jobs=2
)

for i, doc_chunks in enumerate(results):
    print(f"\n## Document {i+1}")
    for j, chunk in enumerate(doc_chunks):
        print(f"Chunk {j+1}:\n{chunk}")

⚙️ GPT-2 Token Count Support

from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def gpt2_token_count(text: str) -> int:
    return len(tokenizer.encode(text))

chunker = Chunklet(token_counter=gpt2_token_count)

🧪 Planned Features

[x] CLI interface with --file, --mode, --overlap, etc. [ ] code splitting based on interest point [ ] PDF splitter with metadata [ ] Named chunking presets (conceptually "all", "random_gap") for downstream control


🌍 Language Support (30+)

  • CRF-based: en, fr, de, it, ru, zh, ja, ko, pt, tr, etc.
  • Heuristic-based: es, nl, da, fi, no, sv, cs, hu, el, ro, etc.
  • Fallback: All other languages via smart regex

💡Projects that inspire me

Tool Description
Semchunk Semantic-aware chunking using transformer embeddings.
CintraAI Code Chunker AST-based code chunker for intelligent code splitting.

🤝 Contributing

  1. Fork this repo
  2. Create a new feature branch
  3. Code like a star
  4. Submit a pull request

📜 License

MIT License. Use freely, modify boldly, and credit the legend (me. Just kidding!)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunklet-1.0.4.post4.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

chunklet-1.0.4.post4-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file chunklet-1.0.4.post4.tar.gz.

File metadata

  • Download URL: chunklet-1.0.4.post4.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for chunklet-1.0.4.post4.tar.gz
Algorithm Hash digest
SHA256 100d66b29a4e84aa75539ef72afbc4a95e7ab9aaef020b52e250ca9a2ccd7bb7
MD5 a5bdfc202088c560db9a878d3bb36d31
BLAKE2b-256 af374c261e85778763154c2471afd17bc82baa46c1c2201a28a2d1c797d7c6f8

See more details on using hashes here.

File details

Details for the file chunklet-1.0.4.post4-py3-none-any.whl.

File metadata

  • Download URL: chunklet-1.0.4.post4-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for chunklet-1.0.4.post4-py3-none-any.whl
Algorithm Hash digest
SHA256 a2ac21eae78742b1c822bdea98680378f1d31abaf80f99734f3cea3482a44b75
MD5 53222afb3e59a96906b75b02894227fa
BLAKE2b-256 34be3694b94f7d383e33f857e1b16ad2843451b729a73c4ea2c2645c98399d2d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page