Hierarchical document chunking for RAG. Structure in, structure out.
Project description
poma-primecut-nano
Chunk markdown documents without losing structure.
Most RAG chunkers split text by tokens, sentences, or paragraphs — then your retrieval returns orphaned fragments with no context. A paragraph about "rate limiting" arrives without its parent heading ("API Authentication"), so the LLM hallucinates the rest.
poma-primecut-nano chunks by heading hierarchy and returns self-contained retrieval units that carry their full ancestor path. Your search results read like compressed versions of the original document — not a pile of disconnected snippets.
pip install poma-primecut-nano
The problem: fragment soup
Every RAG pipeline chunks documents. Most do it wrong:
Standard chunker output for "rate limiting" query:
❌ "Requests are limited to 100/minute per API key.
Exceeding this limit returns HTTP 429."
(What section? What API? What authentication model? Lost.)
poma-primecut-nano output for the same query:
✅ API Reference
Authentication
All endpoints require Bearer token authentication.
[...]
Rate Limiting
Requests are limited to 100/minute per API key.
Exceeding this limit returns HTTP 429.
The second result carries context. The first requires the LLM to guess.
How it works
Three steps. You own the middle one (search).
Step 1: Chunk your document
from poma_primecut_nano import chunk
chunks = chunk(markdown_text)
One call. Each chunk has: chunk_index, content, depth, parent_chunk_index.
Step 2: Build chunksets (retrieval units)
from poma_primecut_nano import chunks_to_chunksets
chunksets = chunks_to_chunksets(chunks)
# Each chunkset is a root-to-leaf path through the heading tree.
# Use chunkset["to_embed"] for your embedding model.
# Store chunkset["chunk_ids"] alongside the vector.
A chunkset for a deeply nested paragraph includes its parent heading, grandparent section, and document title — all in one retrieval unit. When your vector DB returns this chunkset, the LLM gets complete context.
Step 3: Assemble results from search hits
After your vector search returns matching chunk IDs:
from poma_primecut_nano import expand_chunk_ids, assemble_context
# Expand hits to include ancestor headings
expanded = expand_chunk_ids(chunks, hit_chunk_ids)
# Assemble into readable text with [...] gap markers
context = assemble_context(chunks, expanded)
The output is a coherent cheatsheet — multiple hits from the same document merged with [...] gap markers between non-contiguous sections.
Works with everything
poma-primecut-nano is a pure chunking + assembly library. It has no opinion on how you search, embed, or store — bring whatever you already use:
| Vector DB | Embedding | Framework |
|---|---|---|
| FAISS | OpenAI | LangChain |
| Chroma | Cohere | LlamaIndex |
| Qdrant | Voyage | Haystack |
| Pinecone | BGE-M3 | Vercel AI SDK |
| Weaviate | Jina | custom |
| TurboPuffer | model2vec | none needed |
| pgvector | any | any |
These are examples — any vector DB, any embedding model, any framework (or none) will work. The chunker produces plain Python dicts; the assembler takes plain lists of IDs.
Integration pattern:
from poma_primecut_nano import chunk, chunks_to_chunksets, expand_chunk_ids, assemble_context
# === Indexing ===
chunks = chunk(document_text)
chunksets = chunks_to_chunksets(chunks)
for cs in chunksets:
vector = your_embedding_model.embed(cs["to_embed"])
your_vector_db.upsert(id=cs["chunkset_index"], vector=vector, metadata={
"chunk_ids": cs["chunk_ids"],
"file": "docs/api.md",
})
# === Retrieval ===
query_vector = your_embedding_model.embed(user_query)
results = your_vector_db.search(query_vector, top_k=10)
# Collect all chunk IDs from matching chunksets
hit_ids = [cid for r in results for cid in r.metadata["chunk_ids"]]
# Expand + assemble (chunks must be stored/cached per file)
expanded = expand_chunk_ids(chunks, hit_ids)
context = assemble_context(chunks, expanded)
# Feed to LLM
answer = llm.complete(f"Based on this context:\n{context}\n\nQuestion: {user_query}")
API reference
Chunking
| Function | Input | Output |
|---|---|---|
chunk(text) |
Raw markdown string | List of {chunk_index, content, depth, parent_chunk_index} |
Chunkset building
| Function | Input | Output |
|---|---|---|
chunks_to_chunksets(chunks) |
Chunks with parent links | List of {chunkset_index, chunk_ids, contents, to_embed} |
chunks_to_chunksets_optimized(chunks) |
Same | Fewer, more distinct chunksets (collapse/merge algorithm) |
build_ancestor_maps(chunks) |
Same | (parent_map, ancestors_map) dicts |
Result assembly
| Function | Input | Output |
|---|---|---|
expand_chunk_ids(chunks, hit_ids) |
All chunks + search hit IDs | Expanded IDs including ancestors |
expand_chunk_ids_deep(chunks, hit_ids) |
Same | Smarter expansion: deepest-unique + subtrees |
assemble_context(chunks, expanded_ids) |
All chunks + expanded IDs | Readable text with [...] gap markers |
Utility
| Function | Input | Output |
|---|---|---|
normalize_for_embedding(text) |
Raw text | Clean text for embedding (HTML strip, NFKD, whitespace) |
Optimized chunksets
chunks_to_chunksets() is simple and predictable: one chunkset per leaf chunk, with ancestors prepended.
chunks_to_chunksets_optimized() produces fewer, more distinct chunksets using a collapse/merge/sibling-fill algorithm:
- Collapse contiguous same-depth chunks into blocks within a token budget
- Merge adjacent blocks upward when they fit together
- Fill preceding sibling blocks for richer context
- Deduplicate by removing subset chunksets
This reduces embedding costs (fewer vectors) while improving retrieval quality (each chunkset is more self-contained). Ported from POMA's production pipeline.
What this is (and isn't)
poma-primecut-nano is the chunking and assembly layer extracted from POMA's document processing platform. It works best on structured markdown — headings, lists, code blocks, tables.
It does not include: search, vector storage, embeddings, or LLM integration. Those are your choices.
Garbage in, garbage out
Chunking quality depends on input quality. If your markdown has flat walls of text with no headings, there's no hierarchy to preserve. For best results:
- Use well-structured markdown with heading levels (
#,##,###) - Keep lists and code blocks properly indented
- If you're processing raw documents (PDFs, Word, scans), you need an ingestion step first
POMA PrimeCut converts any document into clean, hierarchically structured markdown — the ideal input for this library. Available as cloud API (pay-as-you-go from €0.003/page) and on-prem.
Related projects
- poma-memory — Persistent context for AI coding agents. Uses poma-primecut-nano under the hood, adds BM25 + semantic search, SQLite persistence, incremental indexing, and an MCP server for Claude Code.
- POMA PrimeCut — Production document processing: OCR, ML-powered structural analysis, format conversion. Cloud and on-prem.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file poma_primecut_nano-0.1.0.tar.gz.
File metadata
- Download URL: poma_primecut_nano-0.1.0.tar.gz
- Upload date:
- Size: 19.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9738ad45acbc180ab0305efd5b30f62468f4efea01385240918778a243acb48
|
|
| MD5 |
5a65a0b0bc99ec590c712f9cc5c553a1
|
|
| BLAKE2b-256 |
7b4055df2507b7b8931e214eedbf5d57d87d3fcbc25bea5eae24b603cc9dbb89
|
Provenance
The following attestation bundles were made for poma_primecut_nano-0.1.0.tar.gz:
Publisher:
pypi-publish.yml on poma-ai/poma-primecut-nano
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
poma_primecut_nano-0.1.0.tar.gz -
Subject digest:
a9738ad45acbc180ab0305efd5b30f62468f4efea01385240918778a243acb48 - Sigstore transparency entry: 1202896692
- Sigstore integration time:
-
Permalink:
poma-ai/poma-primecut-nano@b9de536822b904856b2e8502d1e7dddaf5869724 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/poma-ai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@b9de536822b904856b2e8502d1e7dddaf5869724 -
Trigger Event:
push
-
Statement type:
File details
Details for the file poma_primecut_nano-0.1.0-py3-none-any.whl.
File metadata
- Download URL: poma_primecut_nano-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32708f15504e916a462bd70e7d096d586d6bea2a49acec64de33f1864d356546
|
|
| MD5 |
dd27452c1ae92b53fd870cfb6b27d6a2
|
|
| BLAKE2b-256 |
818c315d633620b9412b4823f15c4c97fd3d6eb4c1710dede8b5b13ad3054a68
|
Provenance
The following attestation bundles were made for poma_primecut_nano-0.1.0-py3-none-any.whl:
Publisher:
pypi-publish.yml on poma-ai/poma-primecut-nano
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
poma_primecut_nano-0.1.0-py3-none-any.whl -
Subject digest:
32708f15504e916a462bd70e7d096d586d6bea2a49acec64de33f1864d356546 - Sigstore transparency entry: 1202896695
- Sigstore integration time:
-
Permalink:
poma-ai/poma-primecut-nano@b9de536822b904856b2e8502d1e7dddaf5869724 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/poma-ai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@b9de536822b904856b2e8502d1e7dddaf5869724 -
Trigger Event:
push
-
Statement type: