Skip to main content

Text chunking for retrieval-augmented generation.

Project description

fancychunk

Markdown chunking for RAG that attempts to craft artisanal, meaningful chunks while remaining reasonably fast and efficient.

pip install 'fancychunk[torch]'     # most users: qwen3, bge_m3 via torch + transformers
pip install 'fancychunk[mlx]'       # macOS arm64: same models via Apple MLX (~2-4× faster)
pip install 'fancychunk[all]'       # both backends
pip install fancychunk              # no backend: structural-only chunking via noop()

The base install is ~180 MB; [torch] adds ~750 MB on CPU Linux, ~2.5 GB on CUDA Linux, ~80 MB on macOS; [mlx] adds ~40 MB on Apple Silicon (no-op elsewhere). Pick the backend you need.

How it compares

Traditional chunkers split at character or token counts, possibly including a recursive separator list to dodge the worst cuts. This is fast and efficient, but can lead to awkward breaks and chunks that don't capture a particular idea well. Other chunkers use an LLM to find meaningful semantic boundaries, but this is slow and expensive, and can be inconsistent.

fancychunk attempts to find a middle ground, producing meaningful chunks reasonably quickly. It uses markdown structure alongside multiple small, local models to produce meaningful, correctly-sized chunks that capture the underlying text's semantic value well.

[insert benchmark results: MB/sec throughput and example NDCG@10/Recall@10/MRR@10 stats from ragkit. compare:

  • simple token-count splitter from langchain
  • chonkie's recursive splitter
  • chonkie semantic splitter
  • fancychunk]

Quick start

fancychunk is async-first — the entry points that touch an embedder are async def. Sync callers wrap with asyncio.run(...).

import asyncio
from fancychunk import chunk_document
from fancychunk.embedders import qwen3_600m

async def main():
    embedder = qwen3_600m()                                 # probably the right pick
    chunks, vectors = await chunk_document(
        open("my-document.md").read(), embedder
    )
    # chunks[i] ⇄ vectors[i] — drop straight into your vector store.

asyncio.run(main())

Building blocks

chunk_document is sugar over the four primitives. Compose them directly when you want more control — different embedders per stage, different max_size per stage, a structural-only split, or storage-time heading breadcrumbs.

split_sentences and split_chunklets are sync (no await points); split_chunks and embed_with_late_chunking are async (they call the embedder):

import asyncio
from fancychunk import (
    split_sentences,
    split_chunklets,
    split_chunks,
    embed_with_late_chunking,
    enrich_with_headings,
)
from fancychunk.embedders import qwen3_600m

async def main():
    embedder = qwen3_600m()
    doc = open("my-document.md").read()

    sentences = split_sentences(doc, max_len=2048)
    chunklets = split_chunklets(sentences, max_size=2048)
    chunks    = await split_chunks(chunklets, embedder, max_size=2048)
    vectors   = await embed_with_late_chunking(chunks, embedder)

asyncio.run(main())

What it does

fancychunk treats chunking as three separable problems, each solved by its own optimization against its own signal:

document  →  split_sentences  →  split_chunklets  →  split_chunks  →  chunks
              (punctuation +     (Markdown headings,    (cosine of
               SaT segmenter)     paragraphs, lists)     adjacent chunklets,
                                                         discourse-corrected)

Stage 1 — split_sentences. Punctuation alone misses too many real-world cases (missing terminals, multilingual text, technical abbreviations like "e.g."), so the default segmenter is SaT (Frohmann et al., 2024) from wtpsplit-lite — a learned model that produces per-character boundary probabilities. A sliding-window dynamic-programming pass (O(N) amortised) then picks boundary positions to maximise total score subject to a configurable min/max sentence length.

Stage 2 — split_chunklets. Sentences are grouped into chunklets — paragraph-sized units targeting roughly three "statements" of information content each. The signal is Markdown block-level structure and a document-relative statement density measure. A 1-D dynamic-programming pass picks chunklet boundaries big enough to embed meaningfully but small enough that each one stays topically coherent.

Stage 3 — split_chunks. Adjacent chunklets are compared by cosine similarity, then discourse-corrected — the mean of typical chunklets' embeddings is projected out so similarity reflects local topic shifts rather than the document's overall theme (Arora et al., 2017). A third dynamic-programming pass picks split points where adjacent chunklets are least similar (this is "level 4" in Greg Kamradt's 5 Levels of Text Splitting taxonomy), subject to a hard max-size covering constraint.

Enrichment

The pipeline includes two enrichment steps that pull document context into each chunk's output. Both are baked into chunk_document with sensible defaults; the building-blocks form exposes them as separate primitives.

Late chunking

embed_with_late_chunking(chunks, embedder) produces one context-aware vector per chunk. Instead of embedding each chunk in isolation, the embedder sees windows of adjacent chunks together so attention can resolve anaphora ("the algorithm" picks up its real referent), and the in-scope Markdown heading stack is prepended once per segment as additional preamble (controlled by include_headings=True, on by default). Typical retrieval-quality win is 4–8 MTEB points (Jina AI's paper has the numbers).

Because the heading stack is already in the embedder's input, the embedding already incorporates heading context — there's no need to also prepend headings to the chunk text before embedding. enrich_with_headings is for the stored text only (see below).

Heading-path enrichment

enrich_with_headings(chunks) returns each chunk with the Markdown heading stack in scope at its start prepended (e.g. "# Top\n## Sub\n\n<chunk text>"). This is useful to add context to chunks that might otherwise lack it; for more information, see Out-of-Context Chunk Problem. Note that the late chunking mode already includes this context; only use this method if you're not using late chunking.

Models

fancychunk uses two kinds of model: a sentence segmenter (Stage 1) and an embedder (Stage 3 + late chunking). Both are lazy-loaded on first use — importing fancychunk itself is cheap and triggers no network calls, and constructing an embedder (e.g. qwen3_600m()) doesn't load the weights either. Weights cache under ~/.cache/huggingface/ so the download happens once per machine; subsequent process runs hit the cache.

Sentence segmenter — SaT. The default is sat-3l-sm from Segment Any Text (Frohmann et al., 2024) via wtpsplit-lite, shipped as ONNX. 408 MB download on first call, ~500 MB resident. Multilingual, punctuation-agnostic, and exposes per-character boundary probabilities directly — exactly the SPEC-CHUNK-106 contract Stage 1 wants. For zero-dependency deployments where you can tolerate lower segmentation quality, pass segmenter=punctuation_segmenter instead: a ~50-line rule-based fallback bundled with the library.

Embedders. Four bundled models trade quality for latency. You pick one explicitly — there's no hidden default — and pass it through to chunk_document (or to the individual primitives). The recommended choice for most workloads is qwen3_600m(): good quality (MTEB Multilingual 64.33, the sub-1B leader), modest memory (~0.5 GB on MLX-mxfp8, ~1 GB on torch), and fast enough to keep interactive workflows responsive.

The factories live in fancychunk.embedders and require one of the install extras above ([torch] or [mlx]). Calling qwen3_600m() without the right backend installed raises an ImportError with the install hint baked in. The MLX backend is auto-selected on Apple Silicon when mlx_embeddings is importable; elsewhere the factories fall back to torch (which requires [torch]). MTEB scores are from each model's published tables; throughput is measured on this machine.

Note on CPU-only torch (Linux). pip install 'fancychunk[torch]' pulls the default torch wheel, which on Linux is the CUDA-bundled build (~2.5 GB) even if you don't have a GPU. If you only need CPU inference, install the CPU wheel first, then add fancychunk:

pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install 'fancychunk[torch]'  # picks up the already-installed torch

PyPI extras can't express the --index-url redirect, so this two-step is the workaround until upstream torch ships size-tagged variants on standard PyPI. macOS torch wheels are already small (~80 MB, no CUDA bundle) — this is a Linux-only concern.

Apple Silicon, MLX path (M2 MacBook Air):

Model factory Backend default Model Params Native dim Resident embed_chunklets mean Tokens/s MTEB-Multi MTEB-Eng
bge_m3() MLX¹ / torch BGE-M3 (CLS pooling) 568M 1024 ~1 GB 139 ms 890 59.50 63.50
qwen3_600m() MLX¹ / torch Qwen3-Embedding-0.6B 596M 1024 ~0.5 GB 79 ms 1,186 64.33 70.70
qwen3_4b() MLX¹ / torch Qwen3-Embedding-4B 3.6B 2560 ~4 GB 516 ms 182 69.45 74.60
qwen3_8b() MLX¹ / torch Qwen3-Embedding-8B 7.6B 4096 ~7 GB 950 ms 99 70.58 75.22

Linux, torch + CUDA path (RTX 3090)²:

Model factory Backend embed_chunklets mean Tokens/s
bge_m3() torch 18 ms 6,843
qwen3_600m() torch 32 ms 2,974
qwen3_4b() torch 39 ms 2,426
qwen3_8b() torch 44 ms 2,162

qwen3_4b and qwen3_8b accept a dim=N argument to truncate via Matryoshka Representation Learning and re-L2-normalize; the compute cost is unchanged. Pass dim=1024 to keep storage-pin-compatibility with qwen3_600m and bge_m3.

¹ MLX builds: mlx-community/bge-m3-mlx-fp16, mlx-community/Qwen3-Embedding-{0.6B,4B,8B}-mxfp8. The Qwen3 variants use 8-bit microscaling (mxfp8) — small enough to fit comfortably on a 24 GB Mac at every tier and the highest-quality MLX build the community publishes. On non-Apple-Silicon, each factory transparently loads the canonical HuggingFace weights and runs on torch + MPS / CUDA / CPU.

² CUDA numbers measured on an NVIDIA GeForce RTX 3090 (24 GB VRAM, driver 580.159.03) with Intel Core i9-10900KF and 32 GB system RAM, on Linux 6.17 with PyTorch 2.12.0 + bundled CUDA 13.0 wheels (Python 3.13). All factories load canonical HuggingFace weights in fp16; weights live on VRAM. Same 3-chunklet bench_factories.py batch as the Mac measurements.

Bring your own embedder

fancychunk ships four bundled embedders (see Models). If you need your own — different backend, custom model, remote service — implement the protocol. All three methods are async:

class Embedder(Protocol):
    n_ctx: int
    async def count_tokens(self, texts: list[str]) -> list[int]: ...
    async def embed_segment(
        self, texts: list[str]
    ) -> tuple[NDArray, list[int]]: ...
    async def embed_chunklets(self, chunklets: list[str]) -> NDArray: ...

embed_chunklets is the pooled per-chunklet path for split_chunks; embed_segment + count_tokens are the token-level path for embed_with_late_chunking. A single class implements both — the bundled embedders do.

For a CPU/GPU embedder (torch, MLX, etc.) wrap your sync forward pass in asyncio.to_thread inside each async method so the call yields control while the device works; for a remote embedder, await your HTTP client directly. The bundled PooledSegmentEmbedder shows the former; examples/embedders/remote_http.py shows the latter against httpx.AsyncClient.

Three runnable reference adapters in examples/embedders/: MLX + Qwen3-Embedding, HuggingFace transformers, and an async-HTTP remote client. All three now implement both halves of the protocol so they're drop-in for split_chunks and chunk_document.

Concurrency

The public async API (split_chunks, embed_with_late_chunking, chunk_document) is safe to drive from multiple coroutines concurrently — asyncio.gather(chunk_document(doc1, emb), chunk_document(doc2, emb), ...) works. Inside embed_with_late_chunking, independent segments are themselves embedded via asyncio.gather, so each document overlaps its own segments' embedding calls.

For a batch of documents, chunk_documents wraps that gather with an optional concurrency cap:

import asyncio
from fancychunk import chunk_documents
from fancychunk.embedders import qwen3_600m

async def main():
    embedder = qwen3_600m()
    docs = [open(p).read() for p in paths]
    results = await chunk_documents(docs, embedder, max_concurrency=8)
    # results[i] is (chunks, vectors) for docs[i].

asyncio.run(main())

Pass max_concurrency=N to cap fan-in (sensible for remote embedders so you don't hammer the server). Omit it to gather all documents at once — fine for bundled embedders since the internal lock serializes to device throughput anyway.

Bundled embedder instances are also safe to drive from multiple threads — internal locking serializes worker-thread access to the underlying model. This covers any code that uses the embedder via asyncio.to_thread or a ThreadPoolExecutor directly. The lock matches what the device can actually deliver — one forward pass at a time — so callers don't need their own synchronization.

For higher throughput, create multiple embedder instances; each loads its own copy of the weights. A remote / true-parallel embedder (examples/embedders/remote_http.py) gets real concurrency from asyncio.gather since it isn't bottlenecked on a single local device.

Observability

Every public function emits an OpenTelemetry span — names like fancychunk.split_sentences, attributes like fancychunk.sentences.count. The library pulls only opentelemetry-api so spans are no-ops until your app configures an SDK. Useful for figuring out which stage just got slow in production.

Status

Alpha (0.1.x). Public API is documented in docs/specs/contracts/public-api.md and locked in by the test suite, but not yet SemVer-stable — that lands at 1.0.0. CI runs pyright strict + pytest on Python 3.12 and 3.13 on every push.

Where the specs live

Behavioral specs in docs/specs/ describe what each function does, not how. Every behavior has a SPEC-CHUNK-NNN ID; every ID has a test. Implementations in other languages are welcome to use the specs verbatim and ignore this Python code entirely.

Acknowledgments

The three-stage pipeline (sentence → chunklet → chunk), the late-chunking strategy, and the contextual-headings helper come from raglite. Specific techniques: the SaT segmenter (Frohmann et al., 2024), Greg Kamradt's 5 Levels of Text Splitting, Arora et al.'s discourse vector (ICLR 2017), the Weaviate / Jina late-chunking work (Günther et al., 2024), and Dan Stites's contextual headings post.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fancychunk-0.2.0.tar.gz (132.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fancychunk-0.2.0-py3-none-any.whl (48.1 kB view details)

Uploaded Python 3

File details

Details for the file fancychunk-0.2.0.tar.gz.

File metadata

  • Download URL: fancychunk-0.2.0.tar.gz
  • Upload date:
  • Size: 132.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fancychunk-0.2.0.tar.gz
Algorithm Hash digest
SHA256 64fcedb2599dc4c89571092cd73e5217d39c41842832cac06a7fb2225c10edd3
MD5 978b11c14ebf200728664ec39ea170b4
BLAKE2b-256 7495cf46167d34f5230c23a0c81721fc9f2bdcd2943987c693d62bc4dec4b347

See more details on using hashes here.

Provenance

The following attestation bundles were made for fancychunk-0.2.0.tar.gz:

Publisher: release.yml on emerose/fancychunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fancychunk-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: fancychunk-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 48.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fancychunk-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fe836fedf41a5b3851fa5d1dd794841f64db634505ad3d938dcc8bf6e81131a9
MD5 4dd6a806c42670d8af360a70d502e5b1
BLAKE2b-256 c4fabed4739a48dc88882cf9a3e510c2a49d8218558a439cb8206134fa0371c8

See more details on using hashes here.

Provenance

The following attestation bundles were made for fancychunk-0.2.0-py3-none-any.whl:

Publisher: release.yml on emerose/fancychunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page