Hierarchical RAG: FAISS retrieval + subtree expansion over a parent/child chunk hierarchy.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

HypRAG

Hierarchical retrieval for structured documents. FAISS cosine k-NN seeds a result set, then subtree_expand walks the chunk parent/child graph to pull every parent, sibling, and child of each hit. The flat encoder finds the right region of the document; the hierarchy walker fills in the surrounding context.

+154% Recall@5 on GDPR  ·  +120% on CPython stdlib  ·  <1 ms/query CPU  ·  no GPU

What it does

Most RAG pipelines treat documents as flat bags of chunks. When the right answer lives in paragraph 15(1)(c) of a regulation, flat retrieval returns the chunk for 15(1)(c) — but loses the article header, the surrounding paragraphs, and the chapter context that make the answer interpretable.

HypRAG keeps that structure. Each chunk carries a node_path (e.g. gdpr.ch3.art15.p1.pa) and a depth tag. After the FAISS lookup, subtree_expand returns the parent, the siblings, and the children of every hit. Same recall as flat FAISS at the seed step, but a much higher hit rate after expansion — the answer arrives with its scaffolding intact.

Benchmarks

GDPR (EU 2016/679) — 821 chunks, 20 hand-labeled queries, BGE-base, K=5

Condition	Recall@5	Precision	Latency
FAISS (flat)	0.286	0.590	0.1 ms
FAISS + subtree_expand	0.727	0.441	0.6 ms
Hybrid (BM25+FAISS) + expand	0.683	0.408	1.8 ms

Expansion lift: +154.2 %. BM25 hybrid hurts on regulatory text (uniform vocabulary).

Chunker generalisation — same GDPR corpus, different chunkers

Chunker	Chunks	FAISS	FAISS+expand	Lift
`GDPRChunker` (domain-specific)	821	0.221	0.549	+148 %
`HTMLChunker` (generic, no domain)	896	0.256	0.564	+120 %

The expansion lift is algorithm-driven, not chunker-biased. A source-agnostic chunker that only uses HTML heading levels and <ol>/<ul> nesting reaches essentially the same post-expansion recall as a hand-crafted GDPR parser.

CPython stdlib — 16k chunks, K=5

Condition	Recall@5
FAISS (flat)	0.092
FAISS + subtree_expand	0.203

Expansion lift: +120 %.

Reproducing the GDPR numbers:

python -m benchmarks.run_legal_comparison --html-path gdpr_corpus.html
python -m benchmarks.compare_chunkers      --html-path gdpr_corpus.html

Install

pip install hyprag                       # core (faiss, sentence-transformers, numpy)
pip install hyprag[legal]                # adds beautifulsoup4 for HTML chunkers
pip install hyprag[api]                  # adds fastapi + uvicorn for the HTTP server
pip install hyprag[dev]                  # pytest, ruff, mypy

Quick start — Python codebase

from hyprag.retriever import HypragRetriever

r = HypragRetriever()              # default encoder: BAAI/bge-base-en-v1.5
r.index_path("./myproject")        # AST-based chunker, module → class → method

for chunk in r.query("how does the parser handle escape sequences?", k=5):
    print(chunk.depth, chunk.node_path, chunk.start_line)

Quick start — GDPR (or any hierarchical HTML)

from hyprag.chunkers import GDPRChunker     # domain-specific, +154% lift
from hyprag.chunkers import HTMLChunker     # generic, +120% lift, zero domain knowledge
from hyprag.retriever import HypragRetriever

# Fetch the corpus once (per-article from gdpr-info.eu; takes ~5 min)
chunks = GDPRChunker().load()              # or .load(html_path=Path("..."))

r = HypragRetriever()
r.index_chunks(chunks)

for chunk in r.query("when must a data breach be reported?", k=5):
    print(chunk.depth, chunk.node_path)
    print(chunk.text[:200])

HTMLChunker works on any HTML document — Wikipedia, documentation, statutes — using only <h1>–<h6> levels and <ol>/<ul>/<li> nesting as hierarchy signals.

HTTP API

uvicorn api.main:app --reload

POST /index/gdpr, POST /index/codebase, POST /index/texts build indexes. POST /search queries them. Each request is authenticated via X-API-Key; tiering (free / paid) caps vectors, queries/day, and TTL — see api/auth.py.

Every IndexResponse returns depth_distribution and warnings, so callers can verify the chunker recovered the hierarchy as expected without inspecting internals.

Subtree expansion

subtree_expand(results, corpus) is the core algorithm. Given any list of seed chunks and the full corpus, it returns the seeds plus every chunk that is:

a parent — chunk.node_path matches a seed's parent_path
a child — chunk.parent_path matches a seed's node_path
a sibling — same parent_path as a seed

All three are toggleable; max_expand caps the result size. The walk is O(N) per query — cheap enough to run on every search.

What's deliberately not here

No geometry. Earlier versions used a Poincaré-ball backend for hyperbolic embeddings. Four experiments across two corpora produced numerically identical results to FAISS at up to 257× the latency. Removed in v0.5.0; the git history preserves the code.
No LLM summaries. Tested; recall regressed. Not coming back.
No cross-encoder reranking by default. bge-reranker-base hurt on code (Recall@5 0.349 → 0.080). Plug your own in if you have a domain-tuned one.
No BM25 by default. Hurts on legal text (uniform vocabulary). Opt-in per-request via HybridRetriever for code corpora where identifiers carry signal.

Status

v0.5.x. The algorithm is stable. The API is stable. The chunkers are tested against real corpora. What's missing is a hosted demo and packaging polish.

License

MIT.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

wetzy

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.7.2

May 16, 2026

0.7.1

May 16, 2026

0.7.0

May 16, 2026

0.6.0

May 16, 2026

0.5.3

May 16, 2026

This version

0.5.2

May 16, 2026

0.5.1

May 16, 2026

0.5.0

May 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hyprag-0.5.2.tar.gz (33.1 kB view details)

Uploaded May 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hyprag-0.5.2-py3-none-any.whl (32.0 kB view details)

Uploaded May 16, 2026 Python 3

File details

Details for the file hyprag-0.5.2.tar.gz.

File metadata

Download URL: hyprag-0.5.2.tar.gz
Upload date: May 16, 2026
Size: 33.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hyprag-0.5.2.tar.gz
Algorithm	Hash digest
SHA256	`c6122b8713795a3204e831d20331c9be5fd41afcda3740413a6f4d3d440c3b48`
MD5	`6b5656e8cfe95506dea27e17b6f8c17b`
BLAKE2b-256	`a08d4e123665f2b435bc69ceeb13d10c7bf43cccf76ffc699934fbc9ab2cfa15`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hyprag-0.5.2.tar.gz:

Publisher: publish.yml on wetzy/hyprag

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hyprag-0.5.2.tar.gz
- Subject digest: c6122b8713795a3204e831d20331c9be5fd41afcda3740413a6f4d3d440c3b48
- Sigstore transparency entry: 1553200582
- Sigstore integration time: May 16, 2026
Source repository:
- Permalink: wetzy/hyprag@66384fc5554a207aecab33b8d2b9785f99030427
- Branch / Tag: refs/tags/v0.5.2
- Owner: https://github.com/wetzy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@66384fc5554a207aecab33b8d2b9785f99030427
- Trigger Event: push

File details

Details for the file hyprag-0.5.2-py3-none-any.whl.

File metadata

Download URL: hyprag-0.5.2-py3-none-any.whl
Upload date: May 16, 2026
Size: 32.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hyprag-0.5.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6f01d0d22213cec8747057d0185a3995050e57e476076d0b9734e93c5635acd1`
MD5	`315f79b99e4cd65f88ca6d7b9f998a1a`
BLAKE2b-256	`40d01a7df8f0e56c214ada148da62119636ccf05f07d9afbc15bb278aea0d7d1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hyprag-0.5.2-py3-none-any.whl:

Publisher: publish.yml on wetzy/hyprag

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hyprag-0.5.2-py3-none-any.whl
- Subject digest: 6f01d0d22213cec8747057d0185a3995050e57e476076d0b9734e93c5635acd1
- Sigstore transparency entry: 1553200583
- Sigstore integration time: May 16, 2026
Source repository:
- Permalink: wetzy/hyprag@66384fc5554a207aecab33b8d2b9785f99030427
- Branch / Tag: refs/tags/v0.5.2
- Owner: https://github.com/wetzy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@66384fc5554a207aecab33b8d2b9785f99030427
- Trigger Event: push

hyprag 0.5.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

HypRAG

What it does

Benchmarks

GDPR (EU 2016/679) — 821 chunks, 20 hand-labeled queries, BGE-base, K=5

Chunker generalisation — same GDPR corpus, different chunkers

CPython stdlib — 16k chunks, K=5

Install

Quick start — Python codebase

Quick start — GDPR (or any hierarchical HTML)

HTTP API

Subtree expansion

What's deliberately not here

Status

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance