Skip to main content

MSchunker – Smart text chunker for LLM preprocessing

Project description

MSchunker – Intelligent Text Chunking for LLMs

MSchunker is a lightweight, structure-aware, and deterministic text chunker designed for modern LLM pipelines.

It transforms long documents into LLM-ready chunks that maintain semantic integrity and are optimized for:

  • Retrieval-Augmented Generation (RAG)
  • Question Answering (QA)
  • Summarization
  • Memory systems
  • Any workflow requiring precise text segmentation

MSchunker respects natural document structure (sections → paragraphs → sentences) and provides rich metadata, task-aware defaults, and optional overlap for cross-chunk context.


Features

  • Structure-aware splitting
    • Detects headings, sections, paragraphs, and sentences
  • Token/character limits
    • Enforces max_tokens and/or max_chars
  • Hierarchical strategy
    • Paragraphs → sentences → hard splits (fallback)
  • Optional token overlap
    • Adds continuity across chunks
  • Rich metadata
    • Section index, paragraph indices, sentence indices, split reasons, offsets
  • Deterministic output
    • Same input + same settings → identical chunks
  • Lightweight
    • Zero heavy NLP / ML dependencies
  • Clean, simple API
    • chunk_text(...) handles everything
    • Chunker for stateful usage

Installation

Install from PyPI:

pip install mschunker

Or directly from GitHub:

pip install git+https://github.com/cspnms/MSchunker.git


⸻

Quickstart

from smartchunk import chunk_text

text = "... your long document ..."

chunks = chunk_text(
    text,
    max_tokens=512,
    overlap_tokens=64,
    strategy="auto",
    task="rag",
)

for ch in chunks:
    print("---- CHUNK ----")
    print(ch.text[:200], "...")
    print(ch.meta)


⸻

API Reference

chunk_text(...)  Main function

chunks = chunk_text(
    text: str,
    max_tokens: int | None = 512,
    max_chars: int | None = None,
    overlap_tokens: int = 64,
    strategy: str = "auto",
    token_counter: callable | None = None,
    source_id: str | None = None,
    task: str | None = None,   # "rag" | "qa" | "summarization" | "memory"
)

Returns: List[Chunk]

⸻

Chunker  Stateful wrapper

from smartchunk import Chunker

c = Chunker(
    max_tokens=512,
    overlap_tokens=64,
    strategy="auto",
    task="rag",
)

chunks = c.chunk(text, source_id="doc-1")


⸻

Chunk  Data Model

Each chunk contains:
		.text  content
		.meta  dictionary with:
		section_index
		section_heading
		paragraph_indices
		sentence_indices
		split_reason
		strategy
		chunk_index
		overlap_from_prev
		overlap_tokens
		source_id

⸻

analyze_chunks(chunks)  Statistics

from smartchunk import analyze_chunks

stats = analyze_chunks(chunks)
print(stats)

Example output:

{
  "num_chunks": 12,
  "min_tokens": 118,
  "max_tokens": 482,
  "avg_tokens": 311.9
}


⸻

explain_chunk(chunk)  Human-readable explanation

from smartchunk import explain_chunk

print(explain_chunk(chunks[0]))

Example:

Strategy: auto | Split reason: paragraph_boundary |
Section #0 heading='Introduction' |
Paragraphs: (0, 1) | Chunk index: 0


⸻

How MSchunker Works

MSchunker uses a hierarchical, structure-preserving algorithm:
	1.	Sections / Headings
	2.	Paragraphs
	3.	Sentences
	4.	Hard splits (when paragraphs or sentences exceed limits)

This ensures chunks are semantically coherent and optimized for LLM input.

Optional overlap_tokens adds continuity across chunks  ideal for RAG and QA.

⸻

Design Principles
		Semantic integrity first
Meaning preserved whenever possible.
		Deterministic and transparent
Output + reasoning are reproducible.
		Lightweight
No NLP or transformer dependencies.
		Extensible foundation
Future roadmap:
		Semantic (embedding-aware) chunking
		Multi-granularity chunk outputs
		Benchmark-driven tuning
		RAG framework adapters

⸻

License

MIT License © 2025 MS

⸻

Contributing

Issues and pull requests are welcome.
MSchunker is designed to evolve into a fully intelligent, future-proof chunking engine.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mschunker-0.1.3.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mschunker-0.1.3-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file mschunker-0.1.3.tar.gz.

File metadata

  • Download URL: mschunker-0.1.3.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for mschunker-0.1.3.tar.gz
Algorithm Hash digest
SHA256 068e670234833e3750671bd452e88a0250c64ee639edbdb86f71eb2da6795083
MD5 75862726cb0d3043bb7b39e102393a7f
BLAKE2b-256 bf0e7a2ee5be7fae5d451a2a7e3cdd3febc7975a606efafe68c7db44d2105a35

See more details on using hashes here.

File details

Details for the file mschunker-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: mschunker-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for mschunker-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f9a1630ae37adf8a80465cf167a2906a157a78dac91043de9c70e4aa95f7e65b
MD5 70d655d0c865653812df71707402ffa0
BLAKE2b-256 50b2825f9b7df292c0ed375af3008b5cc68fa6a162cc25b13d88d1350f2b059f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page