Skip to main content

Hierarchical RAG: FAISS retrieval + subtree expansion over a parent/child chunk hierarchy.

Project description

HypRAG

Hierarchical retrieval for structured documents. FAISS cosine k-NN seeds a result set, then subtree_expand walks the chunk parent/child graph to pull every parent, sibling, and child of each hit. The flat encoder finds the right region of the document; the hierarchy walker fills in the surrounding context.

+154% Recall@5 on GDPR  ·  +120% on CPython stdlib  ·  <1 ms/query CPU  ·  no GPU

What it does

Most RAG pipelines treat documents as flat bags of chunks. When the right answer lives in paragraph 15(1)(c) of a regulation, flat retrieval returns the chunk for 15(1)(c) — but loses the article header, the surrounding paragraphs, and the chapter context that make the answer interpretable.

HypRAG keeps that structure. Each chunk carries a node_path (e.g. gdpr.ch3.art15.p1.pa) and a depth tag. After the FAISS lookup, subtree_expand returns the parent, the siblings, and the children of every hit. Same recall as flat FAISS at the seed step, but a much higher hit rate after expansion — the answer arrives with its scaffolding intact.

Benchmarks

GDPR (EU 2016/679) — 821 chunks, 20 hand-labeled queries, BGE-base, K=5

Condition Recall@5 Precision Latency
FAISS (flat) 0.286 0.590 0.1 ms
FAISS + subtree_expand 0.727 0.441 0.6 ms
Hybrid (BM25+FAISS) + expand 0.683 0.408 1.8 ms

Expansion lift: +154.2 %. BM25 hybrid hurts on regulatory text (uniform vocabulary).

Chunker generalisation — same GDPR corpus, different chunkers

Chunker Chunks FAISS FAISS+expand Lift
GDPRChunker (domain-specific) 821 0.221 0.549 +148 %
HTMLChunker (generic, no domain) 896 0.256 0.564 +120 %

The expansion lift is algorithm-driven, not chunker-biased. A source-agnostic chunker that only uses HTML heading levels and <ol>/<ul> nesting reaches essentially the same post-expansion recall as a hand-crafted GDPR parser.

CPython stdlib — 16k chunks, K=5

Condition Recall@5
FAISS (flat) 0.092
FAISS + subtree_expand 0.203

Expansion lift: +120 %.

Reproducing the GDPR numbers:

python -m benchmarks.run_legal_comparison --html-path gdpr_corpus.html
python -m benchmarks.compare_chunkers      --html-path gdpr_corpus.html

Install

pip install hyprag                       # core (faiss, sentence-transformers, numpy)
pip install hyprag[legal]                # adds beautifulsoup4 for HTML chunkers
pip install hyprag[api]                  # adds fastapi + uvicorn for the HTTP server
pip install hyprag[dev]                  # pytest, ruff, mypy

Quick start — Python codebase

from hyprag.retriever import HypragRetriever

r = HypragRetriever()              # default encoder: BAAI/bge-base-en-v1.5
r.index_path("./myproject")        # AST-based chunker, module → class → method

for chunk in r.query("how does the parser handle escape sequences?", k=5):
    print(chunk.depth, chunk.node_path, chunk.start_line)

Quick start — GDPR (or any hierarchical HTML)

from hyprag.chunkers import GDPRChunker     # domain-specific, +154% lift
from hyprag.chunkers import HTMLChunker     # generic, +120% lift, zero domain knowledge
from hyprag.retriever import HypragRetriever

# Fetch the corpus once (per-article from gdpr-info.eu; takes ~5 min)
chunks = GDPRChunker().load()              # or .load(html_path=Path("..."))

r = HypragRetriever()
r.index_chunks(chunks)

for chunk in r.query("when must a data breach be reported?", k=5):
    print(chunk.depth, chunk.node_path)
    print(chunk.text[:200])

HTMLChunker works on any HTML document — Wikipedia, documentation, statutes — using only <h1><h6> levels and <ol>/<ul>/<li> nesting as hierarchy signals.

HTTP API

uvicorn api.main:app --reload

POST /index/gdpr, POST /index/codebase, POST /index/texts build indexes. POST /search queries them. Each request is authenticated via X-API-Key; tiering (free / paid) caps vectors, queries/day, and TTL — see api/auth.py.

Every IndexResponse returns depth_distribution and warnings, so callers can verify the chunker recovered the hierarchy as expected without inspecting internals.

Subtree expansion

subtree_expand(results, corpus) is the core algorithm. Given any list of seed chunks and the full corpus, it returns the seeds plus every chunk that is:

  • a parentchunk.node_path matches a seed's parent_path
  • a childchunk.parent_path matches a seed's node_path
  • a sibling — same parent_path as a seed

All three are toggleable; max_expand caps the result size. The walk is O(N) per query — cheap enough to run on every search.

What's deliberately not here

  • No geometry. Earlier versions used a Poincaré-ball backend for hyperbolic embeddings. Four experiments across two corpora produced numerically identical results to FAISS at up to 257× the latency. Removed in v0.5.0; the git history preserves the code.
  • No LLM summaries. Tested; recall regressed. Not coming back.
  • No cross-encoder reranking by default. bge-reranker-base hurt on code (Recall@5 0.349 → 0.080). Plug your own in if you have a domain-tuned one.
  • No BM25 by default. Hurts on legal text (uniform vocabulary). Opt-in per-request via HybridRetriever for code corpora where identifiers carry signal.

Status

v0.5.x. The algorithm is stable. The API is stable. The chunkers are tested against real corpora. What's missing is a hosted demo and packaging polish.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hyprag-0.5.2.tar.gz (33.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hyprag-0.5.2-py3-none-any.whl (32.0 kB view details)

Uploaded Python 3

File details

Details for the file hyprag-0.5.2.tar.gz.

File metadata

  • Download URL: hyprag-0.5.2.tar.gz
  • Upload date:
  • Size: 33.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hyprag-0.5.2.tar.gz
Algorithm Hash digest
SHA256 c6122b8713795a3204e831d20331c9be5fd41afcda3740413a6f4d3d440c3b48
MD5 6b5656e8cfe95506dea27e17b6f8c17b
BLAKE2b-256 a08d4e123665f2b435bc69ceeb13d10c7bf43cccf76ffc699934fbc9ab2cfa15

See more details on using hashes here.

Provenance

The following attestation bundles were made for hyprag-0.5.2.tar.gz:

Publisher: publish.yml on wetzy/hyprag

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hyprag-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: hyprag-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 32.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hyprag-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6f01d0d22213cec8747057d0185a3995050e57e476076d0b9734e93c5635acd1
MD5 315f79b99e4cd65f88ca6d7b9f998a1a
BLAKE2b-256 40d01a7df8f0e56c214ada148da62119636ccf05f07d9afbc15bb278aea0d7d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for hyprag-0.5.2-py3-none-any.whl:

Publisher: publish.yml on wetzy/hyprag

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page