Text chunking for retrieval-augmented generation.

These details have not been verified by PyPI

Project description

fancychunk

A small, focused library for splitting text documents into semantically coherent chunks suitable for retrieval-augmented generation.

Status: initial implementation. The full specification lives in docs/specs/; the public API in docs/specs/contracts/public-api.md; the test vectors in docs/specs/test-vectors/. The implementation lives in src/fancychunk/ and covers the three required pipeline stages plus the two optional helpers (embed_with_late_chunking, heading_paths).

Quick start

import numpy as np
from fancychunk import (
    split_sentences,
    split_chunklets,
    split_chunks,
    heading_paths,
)

doc = open("README.md").read()
sentences = split_sentences(doc, max_len=2048)
chunklets = split_chunklets(sentences, max_size=2048)

# Caller supplies the embedding matrix; embedding is not part of
# fancychunk's core pipeline. Any deterministic embedder works.
embeddings = my_embedder(chunklets)
chunks, chunk_embeddings = split_chunks(chunklets, embeddings, max_size=2048)
paths = heading_paths(chunks)

Late chunking — bring your own embedder

embed_with_late_chunking is an optional stage that improves retrieval quality on documents with anaphoric references ("it", "this method", "the algorithm") by giving each sentence an embedding computed in the context of its neighbours. It costs about 4 MTEB points on retrieval benchmarks vs. naive per-chunklet embedding, at the price of ~30% more compute.

The library doesn't ship any embedding model. It owns the algorithm — segment planning with backward preamble, mean-pool per sentence, preamble discard, optional L2 normalize — and delegates everything tokenizer-specific to a caller-supplied SegmentEmbedder. The contract is two methods and one attribute:

class SegmentEmbedder(Protocol):
    n_ctx: int
    def count_tokens(self, sentences: list[str]) -> list[int]: ...
    def embed_segment(
        self, sentences: list[str]
    ) -> tuple[NDArray, list[int]]: ...

Adapters for three deployment shapes ship as runnable examples:

File	Backend	Best for
`examples/embedders/qwen3_mlx.py`	MLX + Qwen3-Embedding	Apple Silicon; offline / batch
`examples/embedders/huggingface_offsets.py`	HuggingFace transformers	Any platform; recommended default
`examples/embedders/remote_http.py`	HTTP client + local tokenizer	When the GPU lives on another machine

See examples/embedders/README.md for guidance on picking an alignment method (offset-based vs. sentinel-token), handling special tokens, and writing your own adapter — typically ~20 lines of glue.

Wire it into the pipeline between stages 2 and 3:

from examples.embedders.huggingface_offsets import HFOffsetEmbedder
from fancychunk import (
    embed_with_late_chunking,
    split_chunklets,
    split_chunks,
    split_sentences,
)

embedder = HFOffsetEmbedder("BAAI/bge-m3")

sentences = split_sentences(doc, max_len=2048)
chunklets = split_chunklets(sentences, max_size=2048)

# Per-sentence embeddings with surrounding context.
sentence_embeddings = embed_with_late_chunking(sentences, embedder)

# Aggregate to per-chunklet (mean-pool over the sentences inside
# each chunklet — the caller's responsibility).
chunklet_embeddings = aggregate_to_chunklets(
    sentence_embeddings, sentences, chunklets
)

chunks, _ = split_chunks(chunklets, chunklet_embeddings, max_size=2048)

Observability

Every public stage emits an OpenTelemetry span with attributes that describe input/output sizes and the option choices that affected the outcome. The library depends only on opentelemetry-api; spans are zero-cost no-ops until the host application configures an SDK and exporter.

Span names are fancychunk.<function> (e.g. fancychunk.split_sentences). Attribute keys use the fancychunk.<key> namespace:

Stage	Attribute keys
`split_sentences`	`document.length`, `min_len`, `max_len`, `segmenter`, `sentences.count`, `short_circuit`
`split_chunklets`	`sentences.count`, `max_size`, `custom_costs`, `chunklets.count`, `short_circuit`
`split_chunks`	`chunklets.count`, `max_size`, `embedding.dim`, `chunks.count`, `short_circuit`
`embed_with_late_chunking`	`sentences.count`, `embedder`, `embedder.n_ctx`, `budget`, `preamble_budget`, `preamble_fraction`, `normalize`, `segments.count`, `embedding.dim`
`heading_paths`	`chunks.count`, `paths.non_empty`

To see them locally, install opentelemetry-sdk and configure a console exporter:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor

provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)

# subsequent fancychunk calls now emit spans to stdout

The library also exposes a standard logging.Logger at fancychunk (currently silent by default; future versions may add INFO-level breadcrumbs at stage transitions).

What it does

Given a Markdown document, fancychunk partitions it into chunks where each chunk:

Respects sentence and paragraph boundaries.
Targets a configurable maximum size.
Begins at a structurally meaningful point (heading, paragraph start).
Groups together semantically related material, splitting where the topic shifts.

Optionally:

When paired with a token-level embedding model, fancychunk can produce per-sentence embeddings that incorporate surrounding- document context ("late chunking"). The caller aggregates them to per-chunklet level (typically by mean-pool over the sentences in each chunklet) before passing them to the semantic-chunking stage.
For each chunk, fancychunk can compute the Markdown heading path that was in scope at the chunk's start, suitable for prepending as embedding context.

What it does not do

It does not parse PDFs, Word documents, or HTML. Input is Markdown.
It does not embed text in the core three-stage pipeline. Embedding is the caller's responsibility; fancychunk consumes pre-computed chunklet embeddings for the semantic-chunking stage. (The optional embed_with_late_chunking helper does invoke an embedder, but it is opt-in and requires the caller to supply one.)
It does not store, index, or retrieve. Output is a list of strings.
It does not generate. There is no LLM in the loop.

How to read the specs

The specs in docs/specs/ are behavioral, not prescriptive about implementation. A spec line says what a function must do, not how to do it. Implementations are free to choose tools, algorithms, libraries, and internal architecture.

Specs are numbered. SPEC-CHUNK-NNN identifiers within each spec correspond to a single testable behavior; the acceptance checklist tracks every ID.

Repo layout

fancychunk/
├── README.md                     # This file
├── LICENSE                       # MIT
├── pyproject.toml                # Package metadata + runtime deps
├── docs/specs/
│   ├── README.md                 # Glossary and reading order
│   ├── 00-pipeline-overview.md   # End-to-end data flow
│   ├── 01-sentence-splitting.md  # Stage 1
│   ├── 02-chunklet-grouping.md   # Stage 2
│   ├── 03-semantic-chunking.md   # Stage 3
│   ├── 04-late-chunking.md       # Optional embed strategy
│   ├── 05-contextual-headings.md # Optional helper
│   ├── contracts/                # Public API signatures
│   ├── test-vectors/             # Concrete input → expected output pairs
│   └── acceptance/               # Pass/fail criteria
├── src/fancychunk/               # Implementation
│   ├── sentences.py              # Stage 1 — sentence splitting
│   ├── chunklets.py              # Stage 2 — chunklet grouping
│   ├── chunks.py                 # Stage 3 — semantic chunking
│   ├── late_chunking.py          # Stage 4 — late chunking (optional)
│   ├── headings.py               # Stage 5 — heading paths (optional)
│   ├── _markdown.py              # Markdown-it heading + opener helpers
│   ├── _segmenter.py             # SaT default + punctuation fallback
│   ├── _constants.py             # Named constants from the specs
│   └── errors.py                 # Exception hierarchy
└── tests/                        # pytest suite covering every TV-*

Production readiness

This is an alpha release (0.1.x). The behaviour the public API documents is fully spec-conforming and locked in by the 88-test suite; what's not yet promised:

API stability. Names and defaults are unlikely to change but aren't yet contract-stable. SemVer applies once the version hits 1.0.0.
SaT model on first run. The default segmenter downloads ~408 MB of weights from Hugging Face on first call. For production deployment, either pre-warm the cache during image build or pass segmenter=punctuation_segmenter if you can tolerate its quality.
Thread safety. The module-level SaT singleton and markdown-it parser are reentrant for read; the library doesn't synchronise. Concurrent calls from multiple threads work because every operation reads-only. Concurrent first-time SaT loading from multiple threads may load the model twice (harmless but wasteful) — pre-warm if this matters.
No global state writes. No caches, no temp files, no logging side effects. The library does not call logging.basicConfig and attaches no handlers.
Determinism. Cross-run reproducibility is guaranteed for every stage given a deterministic segmenter / embedder (see SPEC-CHUNK-901 in the specs).

CI runs pyright in strict mode and pytest against Python 3.12 and 3.13 on every push. Tests use the lightweight punctuation segmenter so CI doesn't need the SaT weights; set FANCYCHUNK_TEST_USE_SAT=1 to exercise the real model.

Releases

Tags of the form vX.Y.Z on main trigger the release workflow (.github/workflows/release.yml), which builds sdist + wheel and publishes to PyPI via Trusted Publishing — no API tokens stored anywhere. The first publish has to be done manually (to reserve the project name on PyPI); subsequent releases ride the workflow.

To cut a release:

# 1. Update the version (single source of truth is pyproject.toml).
# 2. Update CHANGELOG.md.
# 3. Tag and push:
git tag -a v0.1.1 -m "Describe the release"
git push origin v0.1.1

The release workflow takes over from there.

Acknowledgments

The three-stage pipeline (sentence → chunklet → chunk), the late-chunking strategy, and the contextual-headings helper are inspired by the chunking pipeline in raglite. Specific techniques cite their originators inline in the specs: the SaT segmenter, Greg Kamradt's "5 levels" taxonomy, Arora et al.'s discourse-vector technique, the Weaviate / Jina late-chunking work, and Dan Stites's contextual-headings post.

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
Topic
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

0.8.0

May 29, 2026

0.7.0

May 29, 2026

0.6.2

May 29, 2026

0.6.1

May 29, 2026

0.6.0

May 28, 2026

0.5.1

May 28, 2026

0.5.0

May 27, 2026

0.4.0

May 27, 2026

0.3.0

May 27, 2026

0.2.0

May 27, 2026

0.1.1

May 26, 2026

This version

0.1.0

May 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fancychunk-0.1.0.tar.gz (87.5 kB view details)

Uploaded May 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fancychunk-0.1.0-py3-none-any.whl (30.7 kB view details)

Uploaded May 25, 2026 Python 3

File details

Details for the file fancychunk-0.1.0.tar.gz.

File metadata

Download URL: fancychunk-0.1.0.tar.gz
Upload date: May 25, 2026
Size: 87.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fancychunk-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8dd62827627cb15463697dab22d19d3f77c8a03060a20ee20f8124ee50ef236a`
MD5	`2e25dc81c5a6412b4f306deccb674596`
BLAKE2b-256	`0713a18f6d259c963a83e69e74eb0aaac08720e4102f2281297850d0e865a146`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fancychunk-0.1.0.tar.gz:

Publisher: release.yml on emerose/fancychunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fancychunk-0.1.0.tar.gz
- Subject digest: 8dd62827627cb15463697dab22d19d3f77c8a03060a20ee20f8124ee50ef236a
- Sigstore transparency entry: 1631373587
- Sigstore integration time: May 25, 2026
Source repository:
- Permalink: emerose/fancychunk@ddb092bd37608008355fe73606a9b79bbe42a3f3
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/emerose
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ddb092bd37608008355fe73606a9b79bbe42a3f3
- Trigger Event: push

File details

Details for the file fancychunk-0.1.0-py3-none-any.whl.

File metadata

Download URL: fancychunk-0.1.0-py3-none-any.whl
Upload date: May 25, 2026
Size: 30.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fancychunk-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c468823640f6944b69ba7c1eef85782d0d78ddf4ed65708dba16b63828c6dffb`
MD5	`1920c0184086c27b3c49aebda22b45a7`
BLAKE2b-256	`1ed7e04aebb23d09fa599213369b63b34c46ebf847a0e9ce56d64f777761bee4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fancychunk-0.1.0-py3-none-any.whl:

Publisher: release.yml on emerose/fancychunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fancychunk-0.1.0-py3-none-any.whl
- Subject digest: c468823640f6944b69ba7c1eef85782d0d78ddf4ed65708dba16b63828c6dffb
- Sigstore transparency entry: 1631373612
- Sigstore integration time: May 25, 2026
Source repository:
- Permalink: emerose/fancychunk@ddb092bd37608008355fe73606a9b79bbe42a3f3
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/emerose
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ddb092bd37608008355fe73606a9b79bbe42a3f3
- Trigger Event: push

fancychunk 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

fancychunk

Quick start

Late chunking — bring your own embedder

Observability

What it does

What it does not do

How to read the specs

Repo layout

Production readiness

Releases

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance