Skip to main content

MSchunker – Smart text chunker for LLM preprocessing

Project description

MSchunker – Intelligent Text Chunking for LLMs

MSchunker is a lightweight, structure-aware, and deterministic text chunker designed for modern LLM pipelines.

It transforms long documents into LLM-ready chunks that maintain semantic integrity and are optimized for:

  • Retrieval-Augmented Generation (RAG)
  • Question Answering (QA)
  • Summarization
  • Memory systems
  • Any workflow requiring precise text segmentation

MSchunker respects natural document structure (sections → paragraphs → sentences) and provides rich metadata, task-aware defaults, and optional overlap for cross-chunk context.


Features

  • Structure-aware splitting
    • Detects headings, sections, paragraphs, and sentences
  • Token / character limits
    • Enforces max_tokens and/or max_chars
  • Hierarchical strategy
    • Paragraphs → sentences → hard splits (fallback)
  • Optional token overlap
    • Adds continuity across consecutive chunks
  • Rich metadata
    • Section index, paragraph indices, sentence indices, split reasons, offsets
  • Deterministic output
    • Same input + same settings → identical chunks
  • Lightweight
    • Zero heavy NLP / ML dependencies
  • Clean, simple API
    • chunk_text(...) handles everything
    • Chunker for stateful usage

Installation

Install directly from GitHub:

pip install git+https://github.com/cspnms/MSchunker.git

(Once published to PyPI:)

pip install mschunker


⸻

 Quickstart

from mschunker import chunk_text

text = "... your long document ..."

chunks = chunk_text(
    text,
    max_tokens=512,
    overlap_tokens=64,
    strategy="auto",
    task="rag",
)

for ch in chunks:
    print("---- CHUNK ----")
    print(ch.text[:200], "...")
    print(ch.meta) API Reference

chunk_text(...)  Main function

chunks = chunk_text(
    text: str,
    max_tokens: int | None = 512,
    max_chars: int | None = None,
    overlap_tokens: int = 64,
    strategy: str = "auto",      # or "fixed"
    token_counter: callable | None = None,
    source_id: str | None = None,
    task: str | None = None,     # "rag" | "qa" | "summarization" | "memory"
)

Returns: List[Chunk]

⸻

Chunker  Stateful wrapper

from mschunker import Chunker

c = Chunker(
    max_tokens=512,
    overlap_tokens=64,
    strategy="auto",
    task="rag",
)

chunks = c.chunk(text, source_id="doc-1")


⸻

Chunk  Data Model

Each chunk contains:
		.text  the chunk’s content
		.meta  dictionary with:
		section_index
		section_heading
		paragraph_indices
		sentence_indices
		split_reason
		strategy
		chunk_index
		overlap_from_prev
		overlap_tokens
		source_id

⸻

analyze_chunks(chunks)  Chunk statistics

from mschunker import analyze_chunks

stats = analyze_chunks(chunks)
print(stats)

Example:

{
  "num_chunks": 12,
  "min_tokens": 118,
  "max_tokens": 482,
  "avg_tokens": 311.9
}


⸻

explain_chunk(chunk)  Human-readable explanation

from mschunker import explain_chunk

print(explain_chunk(chunks[0]))

Possible output:

Strategy: auto | Split reason: paragraph_boundary |
Section #0 heading='Introduction' |
Paragraphs: (0, 1) | Chunk index: 0 How SmartChunk Works

SmartChunk uses a hierarchical, structure-preserving algorithm:
	1.	Sections / Headings
	2.	Paragraphs
	3.	Sentences
	4.	Hard splits (when paragraphs or sentences exceed limits)

This design mirrors how humans write and ensures chunks are semantically coherent.

Optional overlap (overlap_tokens) adds context continuity across chunks—ideal for RAG retrieval and QA workflows.

⸻

 Design Principles
		Semantic integrity first
Meaning is preserved whenever possible.
		Deterministic and transparent
Output and split reasoning are reproducible and explainable.
		Lightweight
No dependencies on NLP or transformer libraries.
		Extensible foundation
Future roadmap:
		Semantic chunking (embedding-aware)
		Multi-granularity chunk outputs
		Benchmark-driven tuning
		Integration helpers for RAG frameworks

⸻

 License

MIT License © 2025 MS

⸻

 Contributing

Issues and pull requests are welcome.
SmartChunk is designed to evolve into a fully intelligent, future-proof chunking engine.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mschunker-0.1.2.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mschunker-0.1.2-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file mschunker-0.1.2.tar.gz.

File metadata

  • Download URL: mschunker-0.1.2.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for mschunker-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d953e053a9ad6fc69ee0c84430b7b865d62b9b227d0f58b60a0ea5fc2758de68
MD5 311580f5691b18f93f30cb4ff6bba70a
BLAKE2b-256 84a78900c0f2df78431d3ddf003b74716533c01f52820c7d5d1e38926a7923fc

See more details on using hashes here.

File details

Details for the file mschunker-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: mschunker-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for mschunker-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 cab617669230a351bd6bb109b8452d524c30e40f41936cd56fa7b5d2bf99f026
MD5 c345732f567058422ee4787ea2c7ac35
BLAKE2b-256 e73699ceaff781d431d49021ea0c18a1339badf963242b9fdad347cd4c78973a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page