Text chunking for retrieval-augmented generation.
Project description
fancychunk
A small, focused library for splitting text documents into semantically coherent chunks suitable for retrieval-augmented generation.
Status: initial implementation. The full specification lives in
docs/specs/; the public API indocs/specs/contracts/public-api.md; the test vectors indocs/specs/test-vectors/. The implementation lives insrc/fancychunk/and covers the three required pipeline stages plus the two optional helpers (embed_with_late_chunking,heading_paths).
Quick start
import numpy as np
from fancychunk import (
split_sentences,
split_chunklets,
split_chunks,
heading_paths,
)
doc = open("README.md").read()
sentences = split_sentences(doc, max_len=2048)
chunklets = split_chunklets(sentences, max_size=2048)
# Caller supplies the embedding matrix; embedding is not part of
# fancychunk's core pipeline. Any deterministic embedder works.
embeddings = my_embedder(chunklets)
chunks, chunk_embeddings = split_chunks(chunklets, embeddings, max_size=2048)
paths = heading_paths(chunks)
Late chunking — bring your own embedder
embed_with_late_chunking is an optional stage that improves
retrieval quality on documents with anaphoric references ("it",
"this method", "the algorithm") by giving each sentence an embedding
computed in the context of its neighbours. It costs about 4 MTEB
points on retrieval benchmarks vs. naive per-chunklet embedding, at
the price of ~30% more compute.
The library doesn't ship any embedding model. It owns the
algorithm — segment planning with backward preamble, mean-pool per
sentence, preamble discard, optional L2 normalize — and delegates
everything tokenizer-specific to a caller-supplied
SegmentEmbedder.
The contract is two methods and one attribute:
class SegmentEmbedder(Protocol):
n_ctx: int
def count_tokens(self, sentences: list[str]) -> list[int]: ...
def embed_segment(
self, sentences: list[str]
) -> tuple[NDArray, list[int]]: ...
Adapters for three deployment shapes ship as runnable examples:
| File | Backend | Best for |
|---|---|---|
examples/embedders/qwen3_mlx.py |
MLX + Qwen3-Embedding | Apple Silicon; offline / batch |
examples/embedders/huggingface_offsets.py |
HuggingFace transformers | Any platform; recommended default |
examples/embedders/remote_http.py |
HTTP client + local tokenizer | When the GPU lives on another machine |
See examples/embedders/README.md
for guidance on picking an alignment method (offset-based vs.
sentinel-token), handling special tokens, and writing your own
adapter — typically ~20 lines of glue.
Wire it into the pipeline between stages 2 and 3:
from examples.embedders.huggingface_offsets import HFOffsetEmbedder
from fancychunk import (
embed_with_late_chunking,
split_chunklets,
split_chunks,
split_sentences,
)
embedder = HFOffsetEmbedder("BAAI/bge-m3")
sentences = split_sentences(doc, max_len=2048)
chunklets = split_chunklets(sentences, max_size=2048)
# Per-sentence embeddings with surrounding context.
sentence_embeddings = embed_with_late_chunking(sentences, embedder)
# Aggregate to per-chunklet (mean-pool over the sentences inside
# each chunklet — the caller's responsibility).
chunklet_embeddings = aggregate_to_chunklets(
sentence_embeddings, sentences, chunklets
)
chunks, _ = split_chunks(chunklets, chunklet_embeddings, max_size=2048)
Observability
Every public stage emits an OpenTelemetry span with attributes that
describe input/output sizes and the option choices that affected the
outcome. The library depends only on opentelemetry-api; spans are
zero-cost no-ops until the host application configures an SDK and
exporter.
Span names are fancychunk.<function> (e.g.
fancychunk.split_sentences). Attribute keys use the
fancychunk.<key> namespace:
| Stage | Attribute keys |
|---|---|
split_sentences |
document.length, min_len, max_len, segmenter, sentences.count, short_circuit |
split_chunklets |
sentences.count, max_size, custom_costs, chunklets.count, short_circuit |
split_chunks |
chunklets.count, max_size, embedding.dim, chunks.count, short_circuit |
embed_with_late_chunking |
sentences.count, embedder, embedder.n_ctx, budget, preamble_budget, preamble_fraction, normalize, segments.count, embedding.dim |
heading_paths |
chunks.count, paths.non_empty |
To see them locally, install opentelemetry-sdk and configure a
console exporter:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
# subsequent fancychunk calls now emit spans to stdout
The library also exposes a standard logging.Logger at
fancychunk (currently silent by default; future versions may add
INFO-level breadcrumbs at stage transitions).
What it does
Given a Markdown document, fancychunk partitions it into chunks where each chunk:
- Respects sentence and paragraph boundaries.
- Targets a configurable maximum size.
- Begins at a structurally meaningful point (heading, paragraph start).
- Groups together semantically related material, splitting where the topic shifts.
Optionally:
- When paired with a token-level embedding model, fancychunk can produce per-sentence embeddings that incorporate surrounding- document context ("late chunking"). The caller aggregates them to per-chunklet level (typically by mean-pool over the sentences in each chunklet) before passing them to the semantic-chunking stage.
- For each chunk, fancychunk can compute the Markdown heading path that was in scope at the chunk's start, suitable for prepending as embedding context.
What it does not do
- It does not parse PDFs, Word documents, or HTML. Input is Markdown.
- It does not embed text in the core three-stage pipeline. Embedding
is the caller's responsibility; fancychunk consumes pre-computed
chunklet embeddings for the semantic-chunking stage. (The optional
embed_with_late_chunkinghelper does invoke an embedder, but it is opt-in and requires the caller to supply one.) - It does not store, index, or retrieve. Output is a list of strings.
- It does not generate. There is no LLM in the loop.
How to read the specs
The specs in docs/specs/ are behavioral, not
prescriptive about implementation. A spec line says what a function
must do, not how to do it. Implementations are free to choose
tools, algorithms, libraries, and internal architecture.
Specs are numbered. SPEC-CHUNK-NNN identifiers within each spec correspond to a single testable behavior; the acceptance checklist tracks every ID.
Repo layout
fancychunk/
├── README.md # This file
├── LICENSE # MIT
├── pyproject.toml # Package metadata + runtime deps
├── docs/specs/
│ ├── README.md # Glossary and reading order
│ ├── 00-pipeline-overview.md # End-to-end data flow
│ ├── 01-sentence-splitting.md # Stage 1
│ ├── 02-chunklet-grouping.md # Stage 2
│ ├── 03-semantic-chunking.md # Stage 3
│ ├── 04-late-chunking.md # Optional embed strategy
│ ├── 05-contextual-headings.md # Optional helper
│ ├── contracts/ # Public API signatures
│ ├── test-vectors/ # Concrete input → expected output pairs
│ └── acceptance/ # Pass/fail criteria
├── src/fancychunk/ # Implementation
│ ├── sentences.py # Stage 1 — sentence splitting
│ ├── chunklets.py # Stage 2 — chunklet grouping
│ ├── chunks.py # Stage 3 — semantic chunking
│ ├── late_chunking.py # Stage 4 — late chunking (optional)
│ ├── headings.py # Stage 5 — heading paths (optional)
│ ├── _markdown.py # Markdown-it heading + opener helpers
│ ├── _segmenter.py # SaT default + punctuation fallback
│ ├── _constants.py # Named constants from the specs
│ └── errors.py # Exception hierarchy
└── tests/ # pytest suite covering every TV-*
Production readiness
This is an alpha release (0.1.x). The behaviour the public API
documents is fully spec-conforming and locked in by the 88-test
suite; what's not yet promised:
- API stability. Names and defaults are unlikely to change but
aren't yet contract-stable. SemVer applies once the version hits
1.0.0. - SaT model on first run. The default segmenter downloads ~408 MB
of weights from Hugging Face on first call. For production
deployment, either pre-warm the cache during image build or pass
segmenter=punctuation_segmenterif you can tolerate its quality. - Thread safety. The module-level SaT singleton and markdown-it parser are reentrant for read; the library doesn't synchronise. Concurrent calls from multiple threads work because every operation reads-only. Concurrent first-time SaT loading from multiple threads may load the model twice (harmless but wasteful) — pre-warm if this matters.
- No global state writes. No caches, no temp files, no logging
side effects. The library does not call
logging.basicConfigand attaches no handlers. - Determinism. Cross-run reproducibility is guaranteed for every stage given a deterministic segmenter / embedder (see SPEC-CHUNK-901 in the specs).
CI runs pyright in strict mode and pytest against Python 3.12
and 3.13 on every push. Tests use the lightweight punctuation
segmenter so CI doesn't need the SaT weights; set
FANCYCHUNK_TEST_USE_SAT=1 to exercise the real model.
Releases
Tags of the form vX.Y.Z on main trigger the release workflow
(.github/workflows/release.yml), which builds sdist + wheel and
publishes to PyPI via Trusted Publishing
— no API tokens stored anywhere. The first publish has to be done
manually (to reserve the project name on PyPI); subsequent releases
ride the workflow.
To cut a release:
# 1. Update the version (single source of truth is pyproject.toml).
# 2. Update CHANGELOG.md.
# 3. Tag and push:
git tag -a v0.1.1 -m "Describe the release"
git push origin v0.1.1
The release workflow takes over from there.
Acknowledgments
The three-stage pipeline (sentence → chunklet → chunk), the late-chunking strategy, and the contextual-headings helper are inspired by the chunking pipeline in raglite. Specific techniques cite their originators inline in the specs: the SaT segmenter, Greg Kamradt's "5 levels" taxonomy, Arora et al.'s discourse-vector technique, the Weaviate / Jina late-chunking work, and Dan Stites's contextual-headings post.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fancychunk-0.1.0.tar.gz.
File metadata
- Download URL: fancychunk-0.1.0.tar.gz
- Upload date:
- Size: 87.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8dd62827627cb15463697dab22d19d3f77c8a03060a20ee20f8124ee50ef236a
|
|
| MD5 |
2e25dc81c5a6412b4f306deccb674596
|
|
| BLAKE2b-256 |
0713a18f6d259c963a83e69e74eb0aaac08720e4102f2281297850d0e865a146
|
Provenance
The following attestation bundles were made for fancychunk-0.1.0.tar.gz:
Publisher:
release.yml on emerose/fancychunk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fancychunk-0.1.0.tar.gz -
Subject digest:
8dd62827627cb15463697dab22d19d3f77c8a03060a20ee20f8124ee50ef236a - Sigstore transparency entry: 1631373587
- Sigstore integration time:
-
Permalink:
emerose/fancychunk@ddb092bd37608008355fe73606a9b79bbe42a3f3 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/emerose
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@ddb092bd37608008355fe73606a9b79bbe42a3f3 -
Trigger Event:
push
-
Statement type:
File details
Details for the file fancychunk-0.1.0-py3-none-any.whl.
File metadata
- Download URL: fancychunk-0.1.0-py3-none-any.whl
- Upload date:
- Size: 30.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c468823640f6944b69ba7c1eef85782d0d78ddf4ed65708dba16b63828c6dffb
|
|
| MD5 |
1920c0184086c27b3c49aebda22b45a7
|
|
| BLAKE2b-256 |
1ed7e04aebb23d09fa599213369b63b34c46ebf847a0e9ce56d64f777761bee4
|
Provenance
The following attestation bundles were made for fancychunk-0.1.0-py3-none-any.whl:
Publisher:
release.yml on emerose/fancychunk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fancychunk-0.1.0-py3-none-any.whl -
Subject digest:
c468823640f6944b69ba7c1eef85782d0d78ddf4ed65708dba16b63828c6dffb - Sigstore transparency entry: 1631373612
- Sigstore integration time:
-
Permalink:
emerose/fancychunk@ddb092bd37608008355fe73606a9b79bbe42a3f3 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/emerose
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@ddb092bd37608008355fe73606a9b79bbe42a3f3 -
Trigger Event:
push
-
Statement type: