Hierarchical RAG: FAISS retrieval + subtree expansion over a parent/child chunk hierarchy.
Project description
HypRAG
Hierarchical retrieval for structured documents. FAISS cosine k-NN seeds a result set, then subtree_expand walks the chunk parent/child graph to pull every parent, sibling, and child of each hit. The flat encoder finds the right region of the document; the hierarchy walker fills in the surrounding context.
+154% Recall@5 on GDPR · +120% on CPython stdlib · <1 ms/query CPU · no GPU
What it does
Most RAG pipelines treat documents as flat bags of chunks. When the right answer lives in paragraph 15(1)(c) of a regulation, flat retrieval returns the chunk for 15(1)(c) — but loses the article header, the surrounding paragraphs, and the chapter context that make the answer interpretable.
HypRAG keeps that structure. Each chunk carries a node_path (e.g. gdpr.ch3.art15.p1.pa) and a depth tag. After the FAISS lookup, subtree_expand returns the parent, the siblings, and the children of every hit. Same recall as flat FAISS at the seed step, but a much higher hit rate after expansion — the answer arrives with its scaffolding intact.
Benchmarks
GDPR (EU 2016/679) — 821 chunks, 20 hand-labeled queries, BGE-base, K=5
| Condition | Recall@5 | Precision | Latency |
|---|---|---|---|
| FAISS (flat) | 0.286 | 0.590 | 0.1 ms |
| FAISS + subtree_expand | 0.727 | 0.441 | 0.6 ms |
| Hybrid (BM25+FAISS) + expand | 0.683 | 0.408 | 1.8 ms |
Expansion lift: +154.2 %. BM25 hybrid hurts on regulatory text (uniform vocabulary).
Chunker generalisation — same GDPR corpus, different chunkers
| Chunker | Chunks | FAISS | FAISS+expand | Lift |
|---|---|---|---|---|
GDPRChunker (domain-specific) |
821 | 0.221 | 0.549 | +148 % |
HTMLChunker (generic, no domain) |
896 | 0.256 | 0.564 | +120 % |
The expansion lift is algorithm-driven, not chunker-biased. A source-agnostic chunker that only uses HTML heading levels and <ol>/<ul> nesting reaches essentially the same post-expansion recall as a hand-crafted GDPR parser.
CPython stdlib — 16k chunks, K=5
| Condition | Recall@5 |
|---|---|
| FAISS (flat) | 0.092 |
| FAISS + subtree_expand | 0.203 |
Expansion lift: +120 %.
Reproducing the GDPR numbers:
python -m benchmarks.run_legal_comparison --html-path gdpr_corpus.html
python -m benchmarks.compare_chunkers --html-path gdpr_corpus.html
Install
pip install hyprag # core (faiss, sentence-transformers, numpy)
pip install hyprag[legal] # adds beautifulsoup4 for HTML chunkers
pip install hyprag[api] # adds fastapi + uvicorn for the HTTP server
pip install hyprag[dev] # pytest, ruff, mypy
Quick start — Python codebase
from hyprag.retriever import HypragRetriever
r = HypragRetriever() # default encoder: BAAI/bge-base-en-v1.5
r.index_path("./myproject") # AST-based chunker, module → class → method
for chunk in r.query("how does the parser handle escape sequences?", k=5):
print(chunk.depth, chunk.node_path, chunk.start_line)
Quick start — GDPR (or any hierarchical HTML)
from hyprag.chunkers import GDPRChunker # domain-specific, +154% lift
from hyprag.chunkers import HTMLChunker # generic, +120% lift, zero domain knowledge
from hyprag.retriever import HypragRetriever
# Fetch the corpus once (per-article from gdpr-info.eu; takes ~5 min)
chunks = GDPRChunker().load() # or .load(html_path=Path("..."))
r = HypragRetriever()
r.index_chunks(chunks)
for chunk in r.query("when must a data breach be reported?", k=5):
print(chunk.depth, chunk.node_path)
print(chunk.text[:200])
HTMLChunker works on any HTML document — Wikipedia, documentation, statutes — using only <h1>–<h6> levels and <ol>/<ul>/<li> nesting as hierarchy signals.
HTTP API
uvicorn api.main:app --reload
POST /index/gdpr, POST /index/codebase, POST /index/texts build indexes. POST /search queries them. Each request is authenticated via X-API-Key; tiering (free / paid) caps vectors, queries/day, and TTL — see api/auth.py.
Every IndexResponse returns depth_distribution and warnings, so callers can verify the chunker recovered the hierarchy as expected without inspecting internals.
Subtree expansion
subtree_expand(results, corpus) is the core algorithm. Given any list of seed chunks and the full corpus, it returns the seeds plus every chunk that is:
- a parent —
chunk.node_pathmatches a seed'sparent_path - a child —
chunk.parent_pathmatches a seed'snode_path - a sibling — same
parent_pathas a seed
All three are toggleable; max_expand caps the result size. The walk is O(N) per query — cheap enough to run on every search.
What's deliberately not here
- No geometry. Earlier versions used a Poincaré-ball backend for hyperbolic embeddings. Four experiments across two corpora produced numerically identical results to FAISS at up to 257× the latency. Removed in v0.5.0; the git history preserves the code.
- No LLM summaries. Tested; recall regressed. Not coming back.
- No cross-encoder reranking by default.
bge-reranker-basehurt on code (Recall@5 0.349 → 0.080). Plug your own in if you have a domain-tuned one. - No BM25 by default. Hurts on legal text (uniform vocabulary). Opt-in per-request via
HybridRetrieverfor code corpora where identifiers carry signal.
Status
v0.5.x. The algorithm is stable. The API is stable. The chunkers are tested against real corpora. What's missing is a hosted demo and packaging polish.
License
MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hyprag-0.7.0.tar.gz.
File metadata
- Download URL: hyprag-0.7.0.tar.gz
- Upload date:
- Size: 50.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b932ced7d6ee36e719ad14b622da74f13247877d68764994cf6b93d6f5777cbc
|
|
| MD5 |
09713e12d75b1ba84ebbb3c9ef7ab937
|
|
| BLAKE2b-256 |
a001b4ecc510c93e7ec050bbe88dddc7ddb8d4a22edb3a143f842f388b7da88e
|
Provenance
The following attestation bundles were made for hyprag-0.7.0.tar.gz:
Publisher:
publish.yml on wetzy/hyprag
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hyprag-0.7.0.tar.gz -
Subject digest:
b932ced7d6ee36e719ad14b622da74f13247877d68764994cf6b93d6f5777cbc - Sigstore transparency entry: 1553406723
- Sigstore integration time:
-
Permalink:
wetzy/hyprag@48fe21deda94ec46dae31250ead8bdcd136d4dac -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/wetzy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@48fe21deda94ec46dae31250ead8bdcd136d4dac -
Trigger Event:
push
-
Statement type:
File details
Details for the file hyprag-0.7.0-py3-none-any.whl.
File metadata
- Download URL: hyprag-0.7.0-py3-none-any.whl
- Upload date:
- Size: 46.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b30b1a6e176a243e8a2e71c219ad03e116d13ba922da8fec4b33f95bfbb83fe7
|
|
| MD5 |
c726eac7a2bd0a323e86be5be1d86f3b
|
|
| BLAKE2b-256 |
8c8e4df1aa58af8d5c74c7ab6e963041ddb70b8500439f96f59de776daeddbc2
|
Provenance
The following attestation bundles were made for hyprag-0.7.0-py3-none-any.whl:
Publisher:
publish.yml on wetzy/hyprag
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hyprag-0.7.0-py3-none-any.whl -
Subject digest:
b30b1a6e176a243e8a2e71c219ad03e116d13ba922da8fec4b33f95bfbb83fe7 - Sigstore transparency entry: 1553406730
- Sigstore integration time:
-
Permalink:
wetzy/hyprag@48fe21deda94ec46dae31250ead8bdcd136d4dac -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/wetzy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@48fe21deda94ec46dae31250ead8bdcd136d4dac -
Trigger Event:
push
-
Statement type: