Skip to main content

MSchunker – Smart text chunker for LLM preprocessing

Project description

MSchunker – Intelligent Text Chunking for LLMs

MSchunker is a lightweight, structure-aware, and deterministic text chunker designed for modern LLM pipelines.

It transforms long documents into LLM-ready chunks that maintain semantic integrity and are optimized for:

  • Retrieval-Augmented Generation (RAG)
  • Question Answering (QA)
  • Summarization
  • Memory systems
  • Any workflow requiring precise text segmentation

MSchunker respects natural document structure (sections → paragraphs → sentences) and provides rich metadata, task-aware defaults, and optional overlap for cross-chunk context.


Features

  • Structure-aware splitting
    • Detects headings, sections, paragraphs, and sentences
  • Token/character limits
    • Enforces max_tokens and/or max_chars
  • Hierarchical strategy
    • Paragraphs → sentences → hard splits (fallback)
  • Optional token overlap
    • Adds continuity across chunks
  • Rich metadata
    • Section index, paragraph indices, sentence indices, split reasons, offsets
  • Deterministic output
    • Same input + same settings → identical chunks
  • Lightweight
    • Zero heavy NLP / ML dependencies
  • Clean, simple API
    • chunk_text(...) handles everything
    • Chunker for stateful usage

Installation

Install from PyPI:

pip install mschunker

Or directly from GitHub:

pip install git+https://github.com/cspnms/MSchunker.git


⸻

Quickstart

from mschunker import chunk_text

text = "... your long document ..."

chunks = chunk_text(
    text,
    max_tokens=512,
    overlap_tokens=64,
    strategy="auto",
    task="rag",
)

for ch in chunks:
    print("---- CHUNK ----")
    print(ch.text[:200], "...")
    print(ch.meta)


⸻

API Reference

chunk_text(...)  Main function

chunks = chunk_text(
    text: str,
    max_tokens: int | None = 512,
    max_chars: int | None = None,
    overlap_tokens: int = 64,
    strategy: str = "auto",
    token_counter: callable | None = None,
    source_id: str | None = None,
    task: str | None = None,   # "rag" | "qa" | "summarization" | "memory"
)

Returns: List[Chunk]

⸻

Chunker  Stateful wrapper

from mschunker import Chunker

c = Chunker(
    max_tokens=512,
    overlap_tokens=64,
    strategy="auto",
    task="rag",
)

chunks = c.chunk(text, source_id="doc-1")


⸻

Chunk  Data Model

Each chunk contains:
		.text  content
		.meta  dictionary with:
		section_index
		section_heading
		paragraph_indices
		sentence_indices
		split_reason
		strategy
		chunk_index
		overlap_from_prev
		overlap_tokens
		source_id

⸻

analyze_chunks(chunks)  Statistics

from mschunker import analyze_chunks

stats = analyze_chunks(chunks)
print(stats)

Example output:

{
  "num_chunks": 12,
  "min_tokens": 118,
  "max_tokens": 482,
  "avg_tokens": 311.9
}


⸻

explain_chunk(chunk)  Human-readable explanation

from mschunker import explain_chunk

print(explain_chunk(chunks[0]))

Example:

Strategy: auto | Split reason: paragraph_boundary |
Section #0 heading='Introduction' |
Paragraphs: (0, 1) | Chunk index: 0


⸻

How MSchunker Works:

MSchunker uses a hierarchical, structure-preserving algorithm:
	1.	Sections / Headings
	2.	Paragraphs
	3.	Sentences
	4.	Hard splits (when paragraphs or sentences exceed limits)

This ensures chunks are semantically coherent and optimized for LLM input.

Optional overlap_tokens adds continuity across chunks  ideal for RAG and QA.

⸻

Design Principles
		Semantic integrity first
Meaning preserved whenever possible.
		Deterministic and transparent
Output + reasoning are reproducible.
		Lightweight
No NLP or transformer dependencies.
		Extensible foundation
Future roadmap:
		Semantic (embedding-aware) chunking
		Multi-granularity chunk outputs
		Benchmark-driven tuning
		RAG framework adapters

⸻

License

MIT License © 2025 MS

⸻

Contributing

Issues and pull requests are welcome.
MSchunker is designed to evolve into a fully intelligent, future-proof chunking engine.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mschunker-0.1.4.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mschunker-0.1.4-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file mschunker-0.1.4.tar.gz.

File metadata

  • Download URL: mschunker-0.1.4.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for mschunker-0.1.4.tar.gz
Algorithm Hash digest
SHA256 dd0906b82ad96ed706dbc57a45766a4b27f3488047c53b3aa86869d18b819678
MD5 8dc223106696d34330e776f113954e0a
BLAKE2b-256 869929d611e6b7c3617bbc1201a5f221329e0dcc1c2653c9cc3ad56c5f4e5f22

See more details on using hashes here.

File details

Details for the file mschunker-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: mschunker-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for mschunker-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5f5374504dba27919daec0017c5afe023daa15d6068fa3e4b4925726cddcd1ee
MD5 1a72aec7084c2d94489ea03d1ddea498
BLAKE2b-256 a17c9f04fcc94d3af11710ffde6900415b6dfc7ededb8d043c34ded543d5cfad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page