Skip to main content

Text chunking for retrieval-augmented generation.

Project description

fancychunk

A small, focused library for splitting text documents into semantically coherent chunks suitable for retrieval-augmented generation.

Status: initial implementation. The full specification lives in docs/specs/; the public API in docs/specs/contracts/public-api.md; the test vectors in docs/specs/test-vectors/. The implementation lives in src/fancychunk/ and covers the three required pipeline stages plus the two optional helpers (embed_with_late_chunking, heading_paths).

Quick start

import numpy as np
from fancychunk import (
    split_sentences,
    split_chunklets,
    split_chunks,
    heading_paths,
)

doc = open("README.md").read()
sentences = split_sentences(doc, max_len=2048)
chunklets = split_chunklets(sentences, max_size=2048)

# Caller supplies the embedding matrix; embedding is not part of
# fancychunk's core pipeline. Any deterministic embedder works.
embeddings = my_embedder(chunklets)
chunks, chunk_embeddings = split_chunks(chunklets, embeddings, max_size=2048)
paths = heading_paths(chunks)

Late chunking — bring your own embedder

embed_with_late_chunking is an optional stage that improves retrieval quality on documents with anaphoric references ("it", "this method", "the algorithm") by giving each sentence an embedding computed in the context of its neighbours. It costs about 4 MTEB points on retrieval benchmarks vs. naive per-chunklet embedding, at the price of ~30% more compute.

The library doesn't ship any embedding model. It owns the algorithm — segment planning with backward preamble, mean-pool per sentence, preamble discard, optional L2 normalize — and delegates everything tokenizer-specific to a caller-supplied SegmentEmbedder. The contract is two methods and one attribute:

class SegmentEmbedder(Protocol):
    n_ctx: int
    def count_tokens(self, sentences: list[str]) -> list[int]: ...
    def embed_segment(
        self, sentences: list[str]
    ) -> tuple[NDArray, list[int]]: ...

Adapters for three deployment shapes ship as runnable examples:

File Backend Best for
examples/embedders/qwen3_mlx.py MLX + Qwen3-Embedding Apple Silicon; offline / batch
examples/embedders/huggingface_offsets.py HuggingFace transformers Any platform; recommended default
examples/embedders/remote_http.py HTTP client + local tokenizer When the GPU lives on another machine

See examples/embedders/README.md for guidance on picking an alignment method (offset-based vs. sentinel-token), handling special tokens, and writing your own adapter — typically ~20 lines of glue.

Wire it into the pipeline between stages 2 and 3:

from examples.embedders.huggingface_offsets import HFOffsetEmbedder
from fancychunk import (
    embed_with_late_chunking,
    split_chunklets,
    split_chunks,
    split_sentences,
)

embedder = HFOffsetEmbedder("BAAI/bge-m3")

sentences = split_sentences(doc, max_len=2048)
chunklets = split_chunklets(sentences, max_size=2048)

# Per-sentence embeddings with surrounding context.
sentence_embeddings = embed_with_late_chunking(sentences, embedder)

# Aggregate to per-chunklet (mean-pool over the sentences inside
# each chunklet — the caller's responsibility).
chunklet_embeddings = aggregate_to_chunklets(
    sentence_embeddings, sentences, chunklets
)

chunks, _ = split_chunks(chunklets, chunklet_embeddings, max_size=2048)

Observability

Every public stage emits an OpenTelemetry span with attributes that describe input/output sizes and the option choices that affected the outcome. The library depends only on opentelemetry-api; spans are zero-cost no-ops until the host application configures an SDK and exporter.

Span names are fancychunk.<function> (e.g. fancychunk.split_sentences). Attribute keys use the fancychunk.<key> namespace:

Stage Attribute keys
split_sentences document.length, min_len, max_len, segmenter, sentences.count, short_circuit
split_chunklets sentences.count, max_size, custom_costs, chunklets.count, short_circuit
split_chunks chunklets.count, max_size, embedding.dim, chunks.count, short_circuit
embed_with_late_chunking sentences.count, embedder, embedder.n_ctx, budget, preamble_budget, preamble_fraction, normalize, segments.count, embedding.dim
heading_paths chunks.count, paths.non_empty

To see them locally, install opentelemetry-sdk and configure a console exporter:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor

provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)

# subsequent fancychunk calls now emit spans to stdout

The library also exposes a standard logging.Logger at fancychunk (currently silent by default; future versions may add INFO-level breadcrumbs at stage transitions).

What it does

Given a Markdown document, fancychunk partitions it into chunks where each chunk:

  • Respects sentence and paragraph boundaries.
  • Targets a configurable maximum size.
  • Begins at a structurally meaningful point (heading, paragraph start).
  • Groups together semantically related material, splitting where the topic shifts.

Optionally:

  • When paired with a token-level embedding model, fancychunk can produce per-sentence embeddings that incorporate surrounding- document context ("late chunking"). The caller aggregates them to per-chunklet level (typically by mean-pool over the sentences in each chunklet) before passing them to the semantic-chunking stage.
  • For each chunk, fancychunk can compute the Markdown heading path that was in scope at the chunk's start, suitable for prepending as embedding context.

What it does not do

  • It does not parse PDFs, Word documents, or HTML. Input is Markdown.
  • It does not embed text in the core three-stage pipeline. Embedding is the caller's responsibility; fancychunk consumes pre-computed chunklet embeddings for the semantic-chunking stage. (The optional embed_with_late_chunking helper does invoke an embedder, but it is opt-in and requires the caller to supply one.)
  • It does not store, index, or retrieve. Output is a list of strings.
  • It does not generate. There is no LLM in the loop.

How to read the specs

The specs in docs/specs/ are behavioral, not prescriptive about implementation. A spec line says what a function must do, not how to do it. Implementations are free to choose tools, algorithms, libraries, and internal architecture.

Specs are numbered. SPEC-CHUNK-NNN identifiers within each spec correspond to a single testable behavior; the acceptance checklist tracks every ID.

Repo layout

fancychunk/
├── README.md                     # This file
├── LICENSE                       # MIT
├── pyproject.toml                # Package metadata + runtime deps
├── docs/specs/
│   ├── README.md                 # Glossary and reading order
│   ├── 00-pipeline-overview.md   # End-to-end data flow
│   ├── 01-sentence-splitting.md  # Stage 1
│   ├── 02-chunklet-grouping.md   # Stage 2
│   ├── 03-semantic-chunking.md   # Stage 3
│   ├── 04-late-chunking.md       # Optional embed strategy
│   ├── 05-contextual-headings.md # Optional helper
│   ├── contracts/                # Public API signatures
│   ├── test-vectors/             # Concrete input → expected output pairs
│   └── acceptance/               # Pass/fail criteria
├── src/fancychunk/               # Implementation
│   ├── sentences.py              # Stage 1 — sentence splitting
│   ├── chunklets.py              # Stage 2 — chunklet grouping
│   ├── chunks.py                 # Stage 3 — semantic chunking
│   ├── late_chunking.py          # Stage 4 — late chunking (optional)
│   ├── headings.py               # Stage 5 — heading paths (optional)
│   ├── _markdown.py              # Markdown-it heading + opener helpers
│   ├── _segmenter.py             # SaT default + punctuation fallback
│   ├── _constants.py             # Named constants from the specs
│   └── errors.py                 # Exception hierarchy
└── tests/                        # pytest suite covering every TV-*

Production readiness

This is an alpha release (0.1.x). The behaviour the public API documents is fully spec-conforming and locked in by the 88-test suite; what's not yet promised:

  • API stability. Names and defaults are unlikely to change but aren't yet contract-stable. SemVer applies once the version hits 1.0.0.
  • SaT model on first run. The default segmenter downloads ~408 MB of weights from Hugging Face on first call. For production deployment, either pre-warm the cache during image build or pass segmenter=punctuation_segmenter if you can tolerate its quality.
  • Thread safety. The module-level SaT singleton and markdown-it parser are reentrant for read; the library doesn't synchronise. Concurrent calls from multiple threads work because every operation reads-only. Concurrent first-time SaT loading from multiple threads may load the model twice (harmless but wasteful) — pre-warm if this matters.
  • No global state writes. No caches, no temp files, no logging side effects. The library does not call logging.basicConfig and attaches no handlers.
  • Determinism. Cross-run reproducibility is guaranteed for every stage given a deterministic segmenter / embedder (see SPEC-CHUNK-901 in the specs).

CI runs pyright in strict mode and pytest against Python 3.12 and 3.13 on every push. Tests use the lightweight punctuation segmenter so CI doesn't need the SaT weights; set FANCYCHUNK_TEST_USE_SAT=1 to exercise the real model.

Releases

Tags of the form vX.Y.Z on main trigger the release workflow (.github/workflows/release.yml), which builds sdist + wheel and publishes to PyPI via Trusted Publishing — no API tokens stored anywhere. The first publish has to be done manually (to reserve the project name on PyPI); subsequent releases ride the workflow.

To cut a release:

# 1. Update the version (single source of truth is pyproject.toml).
# 2. Update CHANGELOG.md.
# 3. Tag and push:
git tag -a v0.1.1 -m "Describe the release"
git push origin v0.1.1

The release workflow takes over from there.

Acknowledgments

The three-stage pipeline (sentence → chunklet → chunk), the late-chunking strategy, and the contextual-headings helper are inspired by the chunking pipeline in raglite. Specific techniques cite their originators inline in the specs: the SaT segmenter, Greg Kamradt's "5 levels" taxonomy, Arora et al.'s discourse-vector technique, the Weaviate / Jina late-chunking work, and Dan Stites's contextual-headings post.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fancychunk-0.1.0.tar.gz (87.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fancychunk-0.1.0-py3-none-any.whl (30.7 kB view details)

Uploaded Python 3

File details

Details for the file fancychunk-0.1.0.tar.gz.

File metadata

  • Download URL: fancychunk-0.1.0.tar.gz
  • Upload date:
  • Size: 87.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fancychunk-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8dd62827627cb15463697dab22d19d3f77c8a03060a20ee20f8124ee50ef236a
MD5 2e25dc81c5a6412b4f306deccb674596
BLAKE2b-256 0713a18f6d259c963a83e69e74eb0aaac08720e4102f2281297850d0e865a146

See more details on using hashes here.

Provenance

The following attestation bundles were made for fancychunk-0.1.0.tar.gz:

Publisher: release.yml on emerose/fancychunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fancychunk-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: fancychunk-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 30.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fancychunk-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c468823640f6944b69ba7c1eef85782d0d78ddf4ed65708dba16b63828c6dffb
MD5 1920c0184086c27b3c49aebda22b45a7
BLAKE2b-256 1ed7e04aebb23d09fa599213369b63b34c46ebf847a0e9ce56d64f777761bee4

See more details on using hashes here.

Provenance

The following attestation bundles were made for fancychunk-0.1.0-py3-none-any.whl:

Publisher: release.yml on emerose/fancychunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page