Skip to main content

Text chunking for retrieval-augmented generation.

Project description

fancychunk

Markdown chunking for RAG that attempts to craft artisanal, meaningful chunks while remaining reasonably fast and efficient.

pip install 'fancychunk[torch]'     # most users: qwen3, bge_m3 via torch + transformers
pip install 'fancychunk[mlx]'       # macOS arm64: same models via Apple MLX (~2-4× faster)
pip install 'fancychunk[all]'       # both backends
pip install fancychunk              # no backend: structural-only chunking via noop()

The base install is ~180 MB; [torch] adds ~750 MB on CPU Linux, ~2.5 GB on CUDA Linux, ~80 MB on macOS; [mlx] adds ~40 MB on Apple Silicon (no-op elsewhere). Pick the backend you need.

How it compares

Traditional chunkers split at character or token counts, possibly including a recursive separator list to dodge the worst cuts. This is fast and efficient, but can lead to awkward breaks and chunks that don't capture a particular idea well. Other chunkers use an LLM to find meaningful semantic boundaries, but this is slow and expensive, and can be inconsistent.

fancychunk attempts to find a middle ground, producing meaningful chunks reasonably quickly. It uses markdown structure alongside multiple small, local models to produce meaningful, correctly-sized chunks that capture the underlying text's semantic value well.

[insert benchmark results: MB/sec throughput and example NDCG@10/Recall@10/MRR@10 stats from ragkit. compare:

  • simple token-count splitter from langchain
  • chonkie's recursive splitter
  • chonkie semantic splitter
  • fancychunk]

Quick start

from fancychunk import chunk_document
from fancychunk.embedders import qwen3_600m

embedder = qwen3_600m()                          # probably the right pick for most uses
chunks, vectors = chunk_document(open("my-document.md").read(), embedder)
# chunks[i] ⇄ vectors[i] — drop straight into your vector store.

Building blocks

chunk_document is sugar over the four primitives. Compose them directly when you want more control — different embedders per stage, different max_size per stage, a structural-only split, or storage-time heading breadcrumbs:

from fancychunk import (
    split_sentences,
    split_chunklets,
    split_chunks,
    embed_with_late_chunking,
    enrich_with_headings,
)
from fancychunk.embedders import qwen3_600m

embedder = qwen3_600m()
doc = open("my-document.md").read()

sentences = split_sentences(doc, max_len=2048)
chunklets = split_chunklets(sentences, max_size=2048)
chunks    = split_chunks(chunklets, embedder, max_size=2048)
vectors   = embed_with_late_chunking(chunks, embedder)

What it does

fancychunk treats chunking as three separable problems, each solved by its own optimization against its own signal:

document  →  split_sentences  →  split_chunklets  →  split_chunks  →  chunks
              (punctuation +     (Markdown headings,    (cosine of
               SaT segmenter)     paragraphs, lists)     adjacent chunklets,
                                                         discourse-corrected)

Stage 1 — split_sentences. Punctuation alone misses too many real-world cases (missing terminals, multilingual text, technical abbreviations like "e.g."), so the default segmenter is SaT (Frohmann et al., 2024) from wtpsplit-lite — a learned model that produces per-character boundary probabilities. A sliding-window dynamic-programming pass (O(N) amortised) then picks boundary positions to maximise total score subject to a configurable min/max sentence length.

Stage 2 — split_chunklets. Sentences are grouped into chunklets — paragraph-sized units targeting roughly three "statements" of information content each. The signal is Markdown block-level structure and a document-relative statement density measure. A 1-D dynamic-programming pass picks chunklet boundaries big enough to embed meaningfully but small enough that each one stays topically coherent.

Stage 3 — split_chunks. Adjacent chunklets are compared by cosine similarity, then discourse-corrected — the mean of typical chunklets' embeddings is projected out so similarity reflects local topic shifts rather than the document's overall theme (Arora et al., 2017). A third dynamic-programming pass picks split points where adjacent chunklets are least similar (this is "level 4" in Greg Kamradt's 5 Levels of Text Splitting taxonomy), subject to a hard max-size covering constraint.

Enrichment

The pipeline includes two enrichment steps that pull document context into each chunk's output. Both are baked into chunk_document with sensible defaults; the building-blocks form exposes them as separate primitives.

Late chunking

embed_with_late_chunking(chunks, embedder) produces one context-aware vector per chunk. Instead of embedding each chunk in isolation, the embedder sees windows of adjacent chunks together so attention can resolve anaphora ("the algorithm" picks up its real referent), and the in-scope Markdown heading stack is prepended once per segment as additional preamble (controlled by include_headings=True, on by default). Typical retrieval-quality win is 4–8 MTEB points (Jina AI's paper has the numbers).

Because the heading stack is already in the embedder's input, the embedding already incorporates heading context — there's no need to also prepend headings to the chunk text before embedding. enrich_with_headings is for the stored text only (see below).

Heading-path enrichment

enrich_with_headings(chunks) returns each chunk with the Markdown heading stack in scope at its start prepended (e.g. "# Top\n## Sub\n\n<chunk text>"). This is useful to add context to chunks that might otherwise lack it; for more information, see Out-of-Context Chunk Problem. Note that the late chunking mode already includes this context; only use this method if you're not using late chunking.

Models

fancychunk uses two kinds of model: a sentence segmenter (Stage 1) and an embedder (Stage 3 + late chunking). Both are lazy-loaded on first use — importing fancychunk itself is cheap and triggers no network calls, and constructing an embedder (e.g. qwen3_600m()) doesn't load the weights either. Weights cache under ~/.cache/huggingface/ so the download happens once per machine; subsequent process runs hit the cache.

Sentence segmenter — SaT. The default is sat-3l-sm from Segment Any Text (Frohmann et al., 2024) via wtpsplit-lite, shipped as ONNX. 408 MB download on first call, ~500 MB resident. Multilingual, punctuation-agnostic, and exposes per-character boundary probabilities directly — exactly the SPEC-CHUNK-106 contract Stage 1 wants. For zero-dependency deployments where you can tolerate lower segmentation quality, pass segmenter=punctuation_segmenter instead: a ~50-line rule-based fallback bundled with the library.

Embedders. Four bundled models trade quality for latency. You pick one explicitly — there's no hidden default — and pass it through to chunk_document (or to the individual primitives). The recommended choice for most workloads is qwen3_600m(): good quality (MTEB Multilingual 64.33, the sub-1B leader), modest memory (~0.5 GB on MLX-mxfp8, ~1 GB on torch), and fast enough to keep interactive workflows responsive.

The factories live in fancychunk.embedders and require one of the install extras above ([torch] or [mlx]). Calling qwen3_600m() without the right backend installed raises an ImportError with the install hint baked in. The MLX backend is auto-selected on Apple Silicon when mlx_embeddings is importable; elsewhere the factories fall back to torch (which requires [torch]). MTEB scores are from each model's published tables; throughput is measured on this machine.

Note on CPU-only torch (Linux). pip install 'fancychunk[torch]' pulls the default torch wheel, which on Linux is the CUDA-bundled build (~2.5 GB) even if you don't have a GPU. If you only need CPU inference, install the CPU wheel first, then add fancychunk:

pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install 'fancychunk[torch]'  # picks up the already-installed torch

PyPI extras can't express the --index-url redirect, so this two-step is the workaround until upstream torch ships size-tagged variants on standard PyPI. macOS torch wheels are already small (~80 MB, no CUDA bundle) — this is a Linux-only concern.

Apple Silicon, MLX path (M2 MacBook Air):

Model factory Backend default Model Params Native dim Resident embed_chunklets mean Tokens/s MTEB-Multi MTEB-Eng
bge_m3() MLX¹ / torch BGE-M3 (CLS pooling) 568M 1024 ~1 GB 139 ms 890 59.50 63.50
qwen3_600m() MLX¹ / torch Qwen3-Embedding-0.6B 596M 1024 ~0.5 GB 79 ms 1,186 64.33 70.70
qwen3_4b() MLX¹ / torch Qwen3-Embedding-4B 3.6B 2560 ~4 GB 516 ms 182 69.45 74.60
qwen3_8b() MLX¹ / torch Qwen3-Embedding-8B 7.6B 4096 ~7 GB 950 ms 99 70.58 75.22

Linux, torch + CUDA path (RTX 3090)²:

Model factory Backend embed_chunklets mean Tokens/s
bge_m3() torch 18 ms 6,843
qwen3_600m() torch 32 ms 2,974
qwen3_4b() torch 39 ms 2,426
qwen3_8b() torch 44 ms 2,162

qwen3_4b and qwen3_8b accept a dim=N argument to truncate via Matryoshka Representation Learning and re-L2-normalize; the compute cost is unchanged. Pass dim=1024 to keep storage-pin-compatibility with qwen3_600m and bge_m3.

¹ MLX builds: mlx-community/bge-m3-mlx-fp16, mlx-community/Qwen3-Embedding-{0.6B,4B,8B}-mxfp8. The Qwen3 variants use 8-bit microscaling (mxfp8) — small enough to fit comfortably on a 24 GB Mac at every tier and the highest-quality MLX build the community publishes. On non-Apple-Silicon, each factory transparently loads the canonical HuggingFace weights and runs on torch + MPS / CUDA / CPU.

² CUDA numbers measured on an NVIDIA GeForce RTX 3090 (24 GB VRAM, driver 580.159.03) with Intel Core i9-10900KF and 32 GB system RAM, on Linux 6.17 with PyTorch 2.12.0 + bundled CUDA 13.0 wheels (Python 3.13). All factories load canonical HuggingFace weights in fp16; weights live on VRAM. Same 3-chunklet bench_factories.py batch as the Mac measurements.

Bring your own embedder

fancychunk ships four bundled embedders (see Models). If you need your own — different backend, custom model, remote service — implement the protocol:

class Embedder(Protocol):
    n_ctx: int
    def count_tokens(self, texts: list[str]) -> list[int]: ...
    def embed_segment(
        self, texts: list[str]
    ) -> tuple[NDArray, list[int]]: ...
    def embed_chunklets(self, chunklets: list[str]) -> NDArray: ...

embed_chunklets is the pooled per-chunklet path for split_chunks; embed_segment + count_tokens are the token-level path for embed_with_late_chunking. A single class implements both — the bundled embedders do.

Three runnable reference adapters in examples/embedders/: MLX + Qwen3-Embedding, HuggingFace transformers, and a remote HTTP client. They currently implement just the late-chunking half of the protocol; add a embed_chunklets method (one batched forward pass over each chunklet, pooled the same way the model was trained) to use them with split_chunks or chunk_document.

Observability

Every public function emits an OpenTelemetry span — names like fancychunk.split_sentences, attributes like fancychunk.sentences.count. The library pulls only opentelemetry-api so spans are no-ops until your app configures an SDK. Useful for figuring out which stage just got slow in production.

Status

Alpha (0.1.x). Public API is documented in docs/specs/contracts/public-api.md and locked in by the test suite, but not yet SemVer-stable — that lands at 1.0.0. CI runs pyright strict + pytest on Python 3.12 and 3.13 on every push.

Where the specs live

Behavioral specs in docs/specs/ describe what each function does, not how. Every behavior has a SPEC-CHUNK-NNN ID; every ID has a test. Implementations in other languages are welcome to use the specs verbatim and ignore this Python code entirely.

Acknowledgments

The three-stage pipeline (sentence → chunklet → chunk), the late-chunking strategy, and the contextual-headings helper come from raglite. Specific techniques: the SaT segmenter (Frohmann et al., 2024), Greg Kamradt's 5 Levels of Text Splitting, Arora et al.'s discourse vector (ICLR 2017), the Weaviate / Jina late-chunking work (Günther et al., 2024), and Dan Stites's contextual headings post.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fancychunk-0.1.1.tar.gz (124.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fancychunk-0.1.1-py3-none-any.whl (45.0 kB view details)

Uploaded Python 3

File details

Details for the file fancychunk-0.1.1.tar.gz.

File metadata

  • Download URL: fancychunk-0.1.1.tar.gz
  • Upload date:
  • Size: 124.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fancychunk-0.1.1.tar.gz
Algorithm Hash digest
SHA256 2f51fb852626382dece45fd8584a4c9ea37907a55ed9efccf32e0d47f30df5fc
MD5 497b21279c724261d308ba560141bee5
BLAKE2b-256 c811ab8a0029e2cee277a40ca634b52500bfadb1eafb2f74edcba750c960e0b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for fancychunk-0.1.1.tar.gz:

Publisher: release.yml on emerose/fancychunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fancychunk-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: fancychunk-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 45.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fancychunk-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4c4b857e79e4317af884f102ffe6334c96a401ba2f631587f3f1566657ea9de5
MD5 0297ef70644e18c288ddf11d7338ec7a
BLAKE2b-256 39438709a66fbbec26e0b78f8952713a16158c6e62f8ea23f1e5797c5c214a43

See more details on using hashes here.

Provenance

The following attestation bundles were made for fancychunk-0.1.1-py3-none-any.whl:

Publisher: release.yml on emerose/fancychunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page