Skip to main content

Smart text chunker for LLM preprocessing (sections → paragraphs → sentences → hard splits).

Project description

SmartChunk – Intelligent Text Chunking for LLMs

SmartChunk is a lightweight, structure-aware, and deterministic text chunker designed for modern LLM pipelines.

It transforms long documents into LLM-ready chunks that maintain semantic integrity and are optimized for:

  • Retrieval-Augmented Generation (RAG)
  • Question Answering (QA)
  • Summarization
  • Memory systems
  • Any workflow requiring precise text segmentation

SmartChunk respects natural document structure (sections → paragraphs → sentences) and provides rich metadata, task-aware defaults, and optional overlap for cross-chunk context.


Features

  • Structure-aware splitting
    • Detects headings, sections, paragraphs, and sentences
  • Token / character limits
    • Enforces max_tokens and/or max_chars
  • Hierarchical strategy
    • Paragraphs → sentences → hard splits (fallback)
  • Optional token overlap
    • Adds continuity across consecutive chunks
  • Rich metadata
    • Section index, paragraph indices, sentence indices, split reasons, offsets
  • Deterministic output
    • Same input + same settings → identical chunks
  • Lightweight
    • Zero heavy NLP / ML dependencies
  • Clean, simple API
    • chunk_text(...) handles everything
    • Chunker for stateful usage

Installation

Install directly from GitHub:

pip install git+https://github.com/cspnms/MSchunker.git

(Once published to PyPI:)

pip install smartchunk


⸻

 Quickstart

from smartchunk import chunk_text

text = "... your long document ..."

chunks = chunk_text(
    text,
    max_tokens=512,
    overlap_tokens=64,
    strategy="auto",
    task="rag",
)

for ch in chunks:
    print("---- CHUNK ----")
    print(ch.text[:200], "...")
    print(ch.meta) API Reference

chunk_text(...)  Main function

chunks = chunk_text(
    text: str,
    max_tokens: int | None = 512,
    max_chars: int | None = None,
    overlap_tokens: int = 64,
    strategy: str = "auto",      # or "fixed"
    token_counter: callable | None = None,
    source_id: str | None = None,
    task: str | None = None,     # "rag" | "qa" | "summarization" | "memory"
)

Returns: List[Chunk]

⸻

Chunker  Stateful wrapper

from smartchunk import Chunker

c = Chunker(
    max_tokens=512,
    overlap_tokens=64,
    strategy="auto",
    task="rag",
)

chunks = c.chunk(text, source_id="doc-1")


⸻

Chunk  Data Model

Each chunk contains:
		.text  the chunk’s content
		.meta  dictionary with:
		section_index
		section_heading
		paragraph_indices
		sentence_indices
		split_reason
		strategy
		chunk_index
		overlap_from_prev
		overlap_tokens
		source_id

⸻

analyze_chunks(chunks)  Chunk statistics

from smartchunk import analyze_chunks

stats = analyze_chunks(chunks)
print(stats)

Example:

{
  "num_chunks": 12,
  "min_tokens": 118,
  "max_tokens": 482,
  "avg_tokens": 311.9
}


⸻

explain_chunk(chunk)  Human-readable explanation

from smartchunk import explain_chunk

print(explain_chunk(chunks[0]))

Possible output:

Strategy: auto | Split reason: paragraph_boundary |
Section #0 heading='Introduction' |
Paragraphs: (0, 1) | Chunk index: 0 How SmartChunk Works

SmartChunk uses a hierarchical, structure-preserving algorithm:
	1.	Sections / Headings
	2.	Paragraphs
	3.	Sentences
	4.	Hard splits (when paragraphs or sentences exceed limits)

This design mirrors how humans write and ensures chunks are semantically coherent.

Optional overlap (overlap_tokens) adds context continuity across chunks—ideal for RAG retrieval and QA workflows.

⸻

 Design Principles
		Semantic integrity first
Meaning is preserved whenever possible.
		Deterministic and transparent
Output and split reasoning are reproducible and explainable.
		Lightweight
No dependencies on NLP or transformer libraries.
		Extensible foundation
Future roadmap:
		Semantic chunking (embedding-aware)
		Multi-granularity chunk outputs
		Benchmark-driven tuning
		Integration helpers for RAG frameworks

⸻

 License

MIT License © 2025 MS

⸻

 Contributing

Issues and pull requests are welcome.
SmartChunk is designed to evolve into a fully intelligent, future-proof chunking engine.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mschunker-0.1.1.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mschunker-0.1.1-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file mschunker-0.1.1.tar.gz.

File metadata

  • Download URL: mschunker-0.1.1.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for mschunker-0.1.1.tar.gz
Algorithm Hash digest
SHA256 173f70c91762c17eeabb8200fd9ce25c1ed984273e210214ce9fa88b076b889e
MD5 2876d5dd3c1e70caa0223ad9ca180ff5
BLAKE2b-256 7735fbc7aafbd9dc4b726f002462fb02ae13b468e1378896eefcebee49373209

See more details on using hashes here.

File details

Details for the file mschunker-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: mschunker-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for mschunker-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cebb6f1e218a2394135e7a69f80ed6347882646ae0cc3927b8e87a7daf7f3768
MD5 a3209074b5d47daee08e083b3b55597c
BLAKE2b-256 b4819f75a357f39076fb50de1761a9138285f36665f068fcfc2013689959262d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page