Skip to main content

MSchunker – Smart text chunker for LLM preprocessing

Project description

MSchunker – Intelligent Text Chunking for LLMs

PyPI version Python versions License Tests

MSchunker is a lightweight, structure-aware, deterministic text chunker designed for modern LLM pipelines.

It transforms long documents into LLM-ready chunks while preserving semantic boundaries and natural writing structure.
Optimized for:

  • Retrieval-Augmented Generation (RAG)
  • Question Answering (QA)
  • Summarization
  • Memory systems
  • Any workflow requiring precise text segmentation

MSchunker respects document structure (sections → paragraphs → sentences) and provides rich metadata, task-aware defaults, and optional token overlap for cross-chunk continuity.

🔗 Links
• PyPI: https://pypi.org/project/mschunker/
• GitHub: https://github.com/cspnms/MSchunker


Features

  • Structure-aware splitting
    • Detects headings, sections, paragraphs, and sentences
  • Token / character limits
    • Enforces max_tokens and/or max_chars
  • Hierarchical strategy
    • Paragraphs → sentences → hard-split fallback
  • Optional token overlap
    • Adds context continuity across chunks
  • Rich metadata
    • Section index, paragraph indices, sentence indices, split reasons
  • Deterministic output
    • Same input + same settings → identical chunks
  • Lightweight
    • No heavy NLP / ML dependencies
  • Clean API
    • chunk_text() function
    • Chunker class for stateful use

Installation

From PyPI:

pip install mschunker

Or latest version from GitHub:

pip install git+https://github.com/cspnms/MSchunker.git


⸻

##  QuikStart

from mschunker import chunk_text

text = "... your long document ..."

chunks = chunk_text(
    text,
    max_tokens=512,
    overlap_tokens=64,
    strategy="auto",
    task="rag",
)

for ch in chunks:
    print("---- CHUNK ----")
    print(ch.text[:200], "...")
    print(ch.meta)##  API Reference

### chunk_text(...)

Main function:

chunks = chunk_text(
    text: str,
    max_tokens: int | None = 512,
    max_chars: int | None = None,
    overlap_tokens: int = 64,
    strategy: str = "auto",          # or "fixed"
    token_counter: callable | None = None,
    source_id: str | None = None,
    task: str | None = None,         # rag | qa | summarization | memory
)

Returns: List[Chunk]### Chunker — Stateful Wrapper

from mschunker import Chunker

c = Chunker(
    max_tokens=512,
    overlap_tokens=64,
    strategy="auto",
    task="rag",
)

chunks = c.chunk(text, source_id="doc-1")##  Chunk Data Model

Each Chunk contains:
		.text  the chunk content
		.meta  metadata including:
		section_index
		section_heading
		paragraph_indices
		sentence_indices
		split_reason
		strategy
		chunk_index
		overlap_from_prev
		overlap_tokens
		source_id

⸻

##  Utilities

### analyze_chunks(chunks)

from mschunker import analyze_chunks

stats = analyze_chunks(chunks)
print(stats)

Example:

{
  "num_chunks": 12,
  "min_tokens": 118,
  "max_tokens": 482,
  "avg_tokens": 311.9
}### explain_chunk(chunk)

from mschunker import explain_chunk

print(explain_chunk(chunks[0]))

Example result:

Strategy: auto | Split reason: paragraph_boundary |
Section #0 heading='Introduction' |
Paragraphs: (0, 1) | Chunk index: 0##  How MSchunker Works

MSchunker uses a hierarchical, structure-preserving algorithm:
	1.	Sections / Headings
	2.	Paragraphs
	3.	Sentences
	4.	Hard splits (fallback)

This ensures chunks remain coherent and optimized for LLM input.

overlap_tokens adds cross-chunk continuity—ideal for RAG or QA systems.

⸻

##  License

MIT License © 2025 MS

⸻

##  Contributing

Issues and pull requests are welcome.
MSchunker is designed to evolve into a fully intelligent, future-proof chunking engine.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mschunker-0.1.4.1.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mschunker-0.1.4.1-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file mschunker-0.1.4.1.tar.gz.

File metadata

  • Download URL: mschunker-0.1.4.1.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for mschunker-0.1.4.1.tar.gz
Algorithm Hash digest
SHA256 7f860bacfc260cb328d6152c61b1e7e12b64dcb0cae3c1e7106cf5456ec2f97f
MD5 10cacd1abe73219ecc00a920b8d3e4db
BLAKE2b-256 9033f2e7c774d4b37279f3cb5c1abdac41b8cc7669d959156ccc77c1239a21a4

See more details on using hashes here.

File details

Details for the file mschunker-0.1.4.1-py3-none-any.whl.

File metadata

  • Download URL: mschunker-0.1.4.1-py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for mschunker-0.1.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 095b2d2a6094d7bbfe1aabe76ac71b6579b5c35e17bd4da4eec0714656c53eaf
MD5 a3256a10ff40db42f2de572a3ef287af
BLAKE2b-256 b9783469b4983540ff9d89b5ece2aa147e9eaac263870a476570d36c74bc3a92

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page