Token-aware text chunker for RAG ingestion. Sentence-respecting, overlap-friendly.
Project description
snipsplit
Token-aware text chunker for RAG ingestion. Sentence-respecting, overlap-friendly. Rust core, Python frontend.
The problem
Your RAG ingestion pipeline needs to split long documents into chunks that fit a context budget. Naive token-window chunking cuts mid-sentence and degrades retrieval. Sentence-only splitting blows the token budget on legal/medical text. The right thing is to greedy-pack sentences into a token-budgeted window, with optional overlap, and fall back to token-level slicing only when a single sentence is genuinely too long.
snipsplit does exactly that, in Rust, fast enough that bulk ingestion
of 100k documents runs in seconds rather than minutes.
Install
pip install snipsplit
30-second quickstart
from snipsplit import Chunker
chunker = Chunker(max_tokens=512, overlap_tokens=64, encoding="cl100k_base")
text = open("long_document.txt").read()
for chunk in chunker.split(text):
print(chunk.token_count, chunk.start, chunk.end, chunk.text[:60])
For batch ingestion across many docs:
texts = [open(p).read() for p in paths]
all_chunks = chunker.split_many(texts, parallel=True) # list[list[Chunk]]
API
class Chunker:
def __init__(
self,
*,
max_tokens: int = 512,
overlap_tokens: int = 0,
min_tokens: int = 1,
encoding: str = "cl100k_base", # or "o200k_base"
) -> None: ...
def split(self, text: str) -> list[Chunk]: ...
def split_many(self, texts: Sequence[str], *, parallel: bool = False) -> list[list[Chunk]]: ...
class Chunk:
text: str
start: int # byte offset in the original text
end: int # byte offset (exclusive)
token_count: int # exact BPE token count
Algorithm
- Split into paragraphs on
\n{2,}, then sentences on[.!?]\s+plus a handful of abbreviations (Mr.,Dr.,e.g., etc.). - Greedy-pack sentences into a chunk while the running token count is
<= max_tokens. - If a single sentence exceeds
max_tokenson its own, slice it at token boundaries (BPE) instead. - Apply
overlap_tokensby re-prepending the last N tokens of each chunk to the next. - Drop chunks shorter than
min_tokens.
License
Dual-licensed under MIT or Apache-2.0 at your option.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file snipsplit-0.1.1-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: snipsplit-0.1.1-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.5 MB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4ebc1610937097b9ca35943f2a7513a5332d47c0c2551b9a5c0ed37466f0d8f
|
|
| MD5 |
4bf13c3676eb50aa8a95c98c2decf6f2
|
|
| BLAKE2b-256 |
aab2088d8d5abf6794948aad8a8f87b55fe74e30d80ca5cd606e2c998cdfe293
|