Skip to main content

Text chunking for retrieval-augmented generation.

Project description

fancychunk

Markdown chunking for RAG that attempts to craft artisanal, meaningful chunks while remaining reasonably fast and efficient.

pip install 'fancychunk[torch]'     # most users: qwen3, bge_m3 via torch + transformers
pip install 'fancychunk[mlx]'       # macOS arm64: same models via Apple MLX (~2-4× faster)
pip install 'fancychunk[all]'       # both backends
pip install fancychunk              # no backend: structural-only chunking via noop()

The base install is ~180 MB; [torch] adds ~750 MB on CPU Linux, ~2.5 GB on CUDA Linux, ~80 MB on macOS; [mlx] adds ~40 MB on Apple Silicon (no-op elsewhere). Pick the backend you need.

How it compares

Traditional chunkers split at character or token counts, possibly including a recursive separator list to dodge the worst cuts. This is fast and efficient, but can lead to awkward breaks and chunks that don't capture a particular idea well. Other chunkers use an LLM to find meaningful semantic boundaries, but this is slow and expensive, and can be inconsistent.

fancychunk attempts to find a middle ground, producing meaningful chunks reasonably quickly. It uses markdown structure alongside multiple small, local models to produce meaningful, correctly-sized chunks that capture the underlying text's semantic value well.

[insert benchmark results: MB/sec throughput and example NDCG@10/Recall@10/MRR@10 stats from ragkit. compare:

  • simple token-count splitter from langchain
  • chonkie's recursive splitter
  • chonkie semantic splitter
  • fancychunk]

Quick start

fancychunk is async-first — the entry points that touch an embedder are async def. Sync callers wrap with asyncio.run(...).

The recommended pipeline is split, then embed each chunk in isolation: split_sentences → split_chunklets → split_chunks, then embed_chunklets on the final chunks for your storage vectors.

import asyncio
from fancychunk import split_sentences, split_chunklets, split_chunks
from fancychunk.embedders import qwen3_600m

async def main():
    embedder = qwen3_600m()                                 # probably the right pick
    doc = open("my-document.md").read()

    sentences = split_sentences(doc, max_len=2048)
    chunklets = split_chunklets(sentences, max_size=2048)
    chunks    = await split_chunks(chunklets, embedder, max_size=2048)
    vectors   = await embedder.embed_chunklets([c.text for c in chunks])
    # chunks[i].text is the chunk content; chunks[i].start / .end are
    # character offsets into the original document; vectors[i] is the
    # isolated storage embedding for chunks[i]. Drop straight into your
    # store.

asyncio.run(main())

Prefer one call? chunk_document(doc, embedder) runs structure-first chunking (see Structure-first chunking): it splits on the document's heading structure, only reaching for the slow models on sections too big to emit whole. Its final embedding pass uses late chunking (see Late chunking) — an enrichment that benchmarked below the plain isolated-chunk path, so if you want isolated embeddings, run the structural split and embed the chunks yourself (shown in Building blocks).

Each chunk is a Chunk — a frozen dataclass with text (always present) plus optional metadata:

  • start / end — half-open character offsets, so document[chunk.start:chunk.end] == chunk.text.
  • heading_path — tuple of full Markdown heading lines in scope at the chunk's start, e.g. ("# Top", "## **Bold** Sub"). Marker count encodes level; inline formatting preserved. Useful for filter-by-section in your vector store, breadcrumb rendering, or attaching as metadata. () means no heading in scope.

More optional fields may be added over time without breaking existing code.

Building blocks

The primitives compose directly — that's the plain pipeline shown in Quick start. Reach for them when you want more control: different embedders per stage, different max_size per stage, a structural-only split, storage-time heading breadcrumbs, or to swap the final embedding pass for the experimental late-chunking enrichment.

split_sentences and split_chunklets are sync (no await points); split_chunks, embed_chunklets, and embed_with_late_chunking are async (they call the embedder). To add storage-time heading breadcrumbs and embed each chunk in isolation:

import asyncio
from fancychunk import (
    split_sentences,
    split_chunklets,
    split_chunks,
    enrich_with_headings,
)
from fancychunk.embedders import qwen3_600m

async def main():
    embedder = qwen3_600m()
    doc = open("my-document.md").read()

    sentences = split_sentences(doc, max_len=2048)
    chunklets = split_chunklets(sentences, max_size=2048)
    chunks    = await split_chunks(chunklets, embedder, max_size=2048)
    chunks    = enrich_with_headings(chunks)               # optional breadcrumbs
    vectors   = await embedder.embed_chunklets([c.text for c in chunks])

asyncio.run(main())

To experiment with late chunking instead, swap the last line for vectors = await embed_with_late_chunking(chunks, embedder) — but see the caveat first.

What it does

fancychunk treats chunking as three separable problems, each solved by its own optimization against its own signal:

document  →  split_sentences  →  split_chunklets  →  split_chunks  →  chunks
              (punctuation +     (Markdown headings,    (cosine of
               SaT segmenter)     paragraphs, lists)     adjacent chunklets,
                                                         discourse-corrected)

Stage 1 — split_sentences. Punctuation alone misses too many real-world cases (missing terminals, multilingual text, technical abbreviations like "e.g."), so the default segmenter is SaT (Frohmann et al., 2024) from wtpsplit-lite — a learned model that produces per-character boundary probabilities. A sliding-window dynamic-programming pass (O(N) amortised) then picks boundary positions to maximise total score subject to a configurable min/max sentence length.

Stage 2 — split_chunklets. Sentences are grouped into chunklets — paragraph-sized units targeting roughly three "statements" of information content each. The signal is Markdown block-level structure and a document-relative statement density measure. A 1-D dynamic-programming pass picks chunklet boundaries big enough to embed meaningfully but small enough that each one stays topically coherent.

Stage 3 — split_chunks. Adjacent chunklets are compared by cosine similarity, then discourse-corrected — the mean of typical chunklets' embeddings is projected out so similarity reflects local topic shifts rather than the document's overall theme (Arora et al., 2017). A third dynamic-programming pass picks split points where adjacent chunklets are least similar (this is "level 4" in Greg Kamradt's 5 Levels of Text Splitting taxonomy), subject to a hard max-size covering constraint.

Structure-first chunking

chunk_document doesn't run the three stages over the whole document. It runs structure-first chunking (split_chunks_structure_first): the document's heading tree is the primary unit, and a section whose whole subtree already fits max_size is emitted as one chunk directly — no sentence segmenter, no embedder call. Only a section that overflows max_size falls back to the three-stage semantic split (split_sentences → split_chunklets → split_chunks) on that span alone.

Two payoffs:

  • Headings land at chunk starts. Because the section is the primary unit, a whole-section-that-fits keeps its heading at the top of the chunk instead of leaving it stranded mid-chunk.
  • The slow models skip already-fitting sections. On a well- sectioned document most text never touches SaT or the embedder, so latency tracks the fraction of the document that overflows.

Tunables beyond max_size:

  • min_size (default 0.35 × max_size) — a chunk-size floor. A unit below it is merged into a neighbor so the partition has no thin stubs; the merge fires only to clear the floor, so distinct sections above it are never packed up to the cap. Pass min_size=0 to keep every heading boundary.

For an isolated-embedding (non-late-chunking) pipeline, run the structural split and embed the chunks yourself:

from fancychunk import split_chunks_structure_first
from fancychunk.embedders import qwen3_600m

embedder = qwen3_600m()
chunks   = await split_chunks_structure_first(doc, embedder, max_size=2048)
vectors  = await embedder.embed_chunklets([c.text for c in chunks])

To reach the older whole-document semantic split (no structural pass at all), compose the primitives directly — that's the Quick start pipeline.

Enrichment

fancychunk exposes two enrichment steps that pull document context into each chunk's output: heading-path enrichment (recommended) and late chunking (experimental). chunk_document applies both; the plain pipeline applies neither unless you ask for it.

Late chunking (experimental)

Heads up. Late chunking is kept for experimentation, not as a recommended default. In downstream RAG benchmarking it did not beat plain isolated-chunk embedding: on BEIR/scifact (short abstracts) it gave only ~+2.85% NDCG@10 with a 0.6B embedder and hurt the 8B model (−1.91%); on Qasper (long papers) it lost to isolated embedding at 0.6B and collapsed at 8B, roughly halving chunk-level evidence recall. The cause is vector homogenization — pooling each chunk's tokens out of one shared forward pass pushes every chunk vector to point the same way (within-paper median cosine reached 0.96 vs 0.67 for the healthy isolated baseline), so cosine ranking degrades to noise. The effect is worst on the bundled causal, last-token-pooled Qwen3 models; late chunking was designed for bidirectional, mean-pooled encoders, which is why jina_v3() is now bundled as the fair test. Until that holds up, prefer the plain embed_chunklets path.

embed_with_late_chunking(chunks, embedder) produces one context-aware vector per chunk. Instead of embedding each chunk in isolation, the embedder sees windows of adjacent chunks together so attention can resolve anaphora ("the algorithm" picks up its real referent), and the in-scope Markdown heading stack is prepended once per segment as additional preamble (controlled by include_headings=True, on by default). Jina AI's paper reports an MTEB win on bidirectional mean-pooled encoders; the caveat above is when and on which architectures that win failed to reproduce.

Because the heading stack is already in the embedder's input, the embedding already incorporates heading context — there's no need to also prepend headings to the chunk text before embedding. enrich_with_headings is for the stored text only (see below).

Heading-path enrichment

enrich_with_headings(chunks) returns each chunk with the Markdown heading stack in scope at its start prepended (e.g. "# Top\n## Sub\n\n<chunk text>"). This is useful to add context to chunks that might otherwise lack it; for more information, see Out-of-Context Chunk Problem. Note that the late chunking mode already includes this context; only use this method if you're not using late chunking.

Models

fancychunk uses two kinds of model: a sentence segmenter (Stage 1) and an embedder (Stage 3 + late chunking). Both are lazy-loaded on first use — importing fancychunk itself is cheap and triggers no network calls, and constructing an embedder (e.g. qwen3_600m()) doesn't load the weights either. Weights cache under ~/.cache/huggingface/ so the download happens once per machine; subsequent process runs hit the cache.

Sentence segmenter — SaT. The default is sat-9l-sm from Segment Any Text (Frohmann et al., 2024) via wtpsplit-lite, shipped as ONNX, run with weighting="hat" inference (which de-weights low-context sliding-window edges). Multilingual, punctuation-agnostic, and exposes per-character boundary probabilities directly — exactly the SPEC-CHUNK-106 contract Stage 1 wants. Three checkpoints are bundled as factories in fancychunk.segmenters, trading speed for scientific-prose quality (see benchmarks/sat-model-selection.md):

from fancychunk import segmenters
split_sentences(doc, segmenter=segmenters.sat_3l())   # fastest; mis-splits "Tab. TABREF21", "SemEval-2014 Task"
split_sentences(doc, segmenter=segmenters.sat_9l())   # default: artifact-free, ~1.3× faster than 12l
split_sentences(doc, segmenter=segmenters.sat_12l())  # highest quality, slowest
split_sentences(doc, segmenter=segmenters.punctuation())  # ~50-line rule-based fallback, no download

SaT only runs on a section that overflows max_size (see Structure-first chunking), so a well-sectioned corpus skips it on most of its text. The sections that do fall back still pay for segmentation, and for corpora of many long sections SaT can dominate. Install onnxruntime-gpu on a CUDA box and the defaults do the right thing — no code changes:

from fancychunk import chunk_documents
from fancychunk.embedders import qwen3_600m

# Picks CUDAExecutionProvider automatically if onnxruntime-gpu is
# installed (else falls back to CPU).
await chunk_documents(docs, qwen3_600m())

Under the hood:

  • SaTSegmenter() defaults to device="auto", which defers to wtpsplit-lite's GPU-first provider auto-detect. Pass an explicit SaTSegmenter(device="cuda"/"cpu") to override.
  • About half the SaT wall on GPU was a per-document Python loop in wtpsplit-lite's token_to_char_probs, which SaTSegmenter now monkey-patches with a vectorised replacement on first load. Set FANCYCHUNK_DISABLE_SAT_FAST_POSTPROCESS=1 to opt out.

Embedders. Five bundled models trade quality for latency. You pick one explicitly — there's no hidden default — and pass it through to chunk_document (or to the individual primitives). The recommended choice for most workloads is qwen3_600m(): good quality (MTEB Multilingual 64.33, the sub-1B leader), modest memory (~0.5 GB on MLX-mxfp8, ~1 GB on torch), and fast enough to keep interactive workflows responsive.

jina_v3() is the one bidirectional, mean-pooled option — jina-embeddings-v3 (570M, 8192 context, native 1024-dim). The other four are causal, last-token- or CLS-pooled. If you want to experiment with late chunking (see the Late chunking caveat), this is the architecture it was designed for. Two strings attached: the weights are CC BY-NC 4.0 (non-commercial — the others are Apache-2.0 / MIT), and the model ships custom architecture code, so the factory sets trust_remote_code=True and runs on torch only (no MLX build).

The factories live in fancychunk.embedders and require one of the install extras above ([torch] or [mlx]). Calling qwen3_600m() without the right backend installed raises an ImportError with the install hint baked in. The MLX backend is auto-selected on Apple Silicon when mlx_embeddings is importable; elsewhere the factories fall back to torch (which requires [torch]). MTEB scores are from each model's published tables; throughput is measured on this machine.

Note on CPU-only torch (Linux). pip install 'fancychunk[torch]' pulls the default torch wheel, which on Linux is the CUDA-bundled build (~2.5 GB) even if you don't have a GPU. If you only need CPU inference, install the CPU wheel first, then add fancychunk:

pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install 'fancychunk[torch]'  # picks up the already-installed torch

PyPI extras can't express the --index-url redirect, so this two-step is the workaround until upstream torch ships size-tagged variants on standard PyPI. macOS torch wheels are already small (~80 MB, no CUDA bundle) — this is a Linux-only concern.

Apple Silicon, MLX path (M2 MacBook Air):

Model factory Backend default Model Params Native dim Resident embed_chunklets mean Tokens/s MTEB-Multi MTEB-Eng
bge_m3() MLX¹ / torch BGE-M3 (CLS pooling) 568M 1024 ~1 GB 139 ms 890 59.50 63.50
qwen3_600m() MLX¹ / torch Qwen3-Embedding-0.6B 596M 1024 ~0.5 GB 79 ms 1,186 64.33 70.70
qwen3_4b() MLX¹ / torch Qwen3-Embedding-4B 3.6B 2560 ~4 GB 516 ms 182 69.45 74.60
qwen3_8b() MLX¹ / torch Qwen3-Embedding-8B 7.6B 4096 ~7 GB 950 ms 99 70.58 75.22
jina_v3() torch³ jina-embeddings-v3 (mean pooling) 570M 1024 ~1 GB —³ —³ 64.44 65.52

Linux, torch + CUDA path (RTX 3090)²:

Model factory Backend embed_chunklets mean Tokens/s
bge_m3() torch 18 ms 6,843
qwen3_600m() torch 32 ms 2,974
qwen3_4b() torch 39 ms 2,426
qwen3_8b() torch 44 ms 2,162
jina_v3() torch —³ —³

qwen3_4b and qwen3_8b accept a dim=N argument to truncate via Matryoshka Representation Learning and re-L2-normalize; the compute cost is unchanged. Pass dim=1024 to keep storage-pin-compatibility with qwen3_600m and bge_m3.

¹ MLX builds: mlx-community/bge-m3-mlx-fp16, mlx-community/Qwen3-Embedding-{0.6B,4B,8B}-mxfp8. The Qwen3 variants use 8-bit microscaling (mxfp8) — small enough to fit comfortably on a 24 GB Mac at every tier and the highest-quality MLX build the community publishes. On non-Apple-Silicon, each factory transparently loads the canonical HuggingFace weights and runs on torch + MPS / CUDA / CPU.

² CUDA numbers measured on an NVIDIA GeForce RTX 3090 (24 GB VRAM, driver 580.159.03) with Intel Core i9-10900KF and 32 GB system RAM, on Linux 6.17 with PyTorch 2.12.0 + bundled CUDA 13.0 wheels (Python 3.13). All factories load canonical HuggingFace weights in fp16; weights live on VRAM. Same 3-chunklet bench_factories.py batch as the Mac measurements.

³ jina_v3() was added without re-running the throughput suite — the ms/tokens-per-second cells (—) are not yet measured on this hardware; run python bench_factories.py to fill them in. Resident is the fp16-weights estimate (570M params). MTEB figures are from the model card: 64.44 is MMTEB (multilingual), 65.52 is the classic MTEB English average — different MTEB vintages than the Qwen3 / BGE-M3 rows, so read across rows with care. jina-embeddings-v3 has no mlx-community build the bundled MLX loader recognizes, so it runs on the torch backend everywhere (MPS on Apple Silicon, CUDA / CPU elsewhere); its weights are CC BY-NC 4.0 (non-commercial) and the factory enables trust_remote_code=True.

Bring your own embedder

fancychunk ships five bundled embedders (see Models). If you need your own — different backend, custom model, remote service — implement the protocol. All three methods are async:

class Embedder(Protocol):
    n_ctx: int
    async def count_tokens(self, texts: list[str]) -> list[int]: ...
    async def embed_segment(
        self, texts: list[str]
    ) -> tuple[NDArray, list[int]]: ...
    async def embed_chunklets(self, chunklets: list[str]) -> NDArray: ...

embed_chunklets is the pooled per-chunklet path for split_chunks; embed_segment + count_tokens are the token-level path for embed_with_late_chunking. A single class implements both — the bundled embedders do.

For a CPU/GPU embedder (torch, MLX, etc.) wrap your sync forward pass in asyncio.to_thread inside each async method so the call yields control while the device works; for a remote embedder, await your HTTP client directly. The bundled PooledSegmentEmbedder shows the former; examples/embedders/remote_http.py shows the latter against httpx.AsyncClient.

Three runnable reference adapters in examples/embedders/: MLX + Qwen3-Embedding, HuggingFace transformers, and an async-HTTP remote client. All three now implement both halves of the protocol so they're drop-in for split_chunks and chunk_document.

Concurrency

The public async API (split_chunks, embed_with_late_chunking, chunk_document) is safe to drive from multiple coroutines concurrently — asyncio.gather(chunk_document(doc1, emb), chunk_document(doc2, emb), ...) works. Inside embed_with_late_chunking, independent segments are themselves embedded via asyncio.gather, so each document overlaps its own segments' embedding calls.

For a batch of documents, chunk_documents wraps that gather with an optional concurrency cap:

import asyncio
from fancychunk import chunk_documents
from fancychunk.embedders import qwen3_600m

async def main():
    embedder = qwen3_600m()
    docs = [open(p).read() for p in paths]
    results = await chunk_documents(docs, embedder, max_concurrency=8)
    # results[i] is (chunks, vectors) for docs[i].

asyncio.run(main())

Pass max_concurrency=N to cap fan-in (sensible for remote embedders so you don't hammer the server). Omit it to gather all documents at once — fine for bundled embedders since the internal lock serializes to device throughput anyway.

Bundled embedder instances are also safe to drive from multiple threads — internal locking serializes worker-thread access to the underlying model. This covers any code that uses the embedder via asyncio.to_thread or a ThreadPoolExecutor directly. The lock matches what the device can actually deliver — one forward pass at a time — so callers don't need their own synchronization.

For higher throughput, create multiple embedder instances; each loads its own copy of the weights. A remote / true-parallel embedder (examples/embedders/remote_http.py) gets real concurrency from asyncio.gather since it isn't bottlenecked on a single local device.

Observability

Every public function emits an OpenTelemetry span — names like fancychunk.split_sentences, attributes like fancychunk.sentences.count. The library pulls only opentelemetry-api so spans are no-ops until your app configures an SDK. Useful for figuring out which stage just got slow in production.

Status

Alpha (0.1.x). Public API is documented in docs/specs/contracts/public-api.md and locked in by the test suite, but not yet SemVer-stable — that lands at 1.0.0. CI runs pyright strict + pytest on Python 3.12 and 3.13 on every push.

Where the specs live

Behavioral specs in docs/specs/ describe what each function does, not how. Every behavior has a SPEC-CHUNK-NNN ID; every ID has a test. Implementations in other languages are welcome to use the specs verbatim and ignore this Python code entirely.

Acknowledgments

The three-stage pipeline (sentence → chunklet → chunk), the late-chunking strategy, and the contextual-headings helper come from raglite. Specific techniques: the SaT segmenter (Frohmann et al., 2024), Greg Kamradt's 5 Levels of Text Splitting, Arora et al.'s discourse vector (ICLR 2017), the Weaviate / Jina late-chunking work (Günther et al., 2024), and Dan Stites's contextual headings post.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fancychunk-0.8.0.tar.gz (197.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fancychunk-0.8.0-py3-none-any.whl (70.9 kB view details)

Uploaded Python 3

File details

Details for the file fancychunk-0.8.0.tar.gz.

File metadata

  • Download URL: fancychunk-0.8.0.tar.gz
  • Upload date:
  • Size: 197.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fancychunk-0.8.0.tar.gz
Algorithm Hash digest
SHA256 9abb41c3232edd170b57b3c63a57a9fb6edf8b4a1fcf645dff6ea99660618fb2
MD5 004e1c3792f259162cfb25f8c30ac693
BLAKE2b-256 7c1783570ca1969e0f626415db42f1da1e7ec3215313b2330ed811115f3ac167

See more details on using hashes here.

Provenance

The following attestation bundles were made for fancychunk-0.8.0.tar.gz:

Publisher: release.yml on emerose/fancychunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fancychunk-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: fancychunk-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 70.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fancychunk-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ac03510ee09730d5eaed5d1e58804393c5b22c04b8a981e55afd6e13c2bf3d2f
MD5 25445f3879643cf888759cae0d901f5f
BLAKE2b-256 961f54f6b218c7c9f682496100832a3ac6a41b3bf9aa8bf1ad06adba6a64f6c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for fancychunk-0.8.0-py3-none-any.whl:

Publisher: release.yml on emerose/fancychunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page