Skip to main content

Hybrid structure-aware retrieval — BM25 + embeddings + document graph expansion. Runs fully offline, no API key required.

Project description

RAGNav

PyPI Python License: MIT Code style: black PyPI Downloads Open in Colab

Production-grade hybrid retrieval — BM25 plus dense embeddings plus optional structure-aware expansion. With sentence-transformers (default), you can index and query without any API key or separate LLM client for the common path below.

Result Detail
SQuAD R@3 0.956 (500 questions, hybrid RRF, zero paid API calls)
CUAD span S@3 0.071 (clause-friendly metric; see benchmarks)

Frontier-LLM “build an index per document” stacks optimize a different cost and reproducibility tradeoff; RAGNav targets pip-install hybrid search with offline, reproducible benchmarks.

pip install ragnav[embeddings]
from ragnav import RAGNavIndex, RAGNavRetriever
from ragnav.ingest.markdown import ingest_markdown_string

doc, blocks = ingest_markdown_string(
    "Paris is the capital of France.",
    name="demo.md",
)
index = RAGNavIndex.build(
    documents=[doc],
    blocks=blocks,
    use_sentence_transformers=True,
    vector_model="all-MiniLM-L6-v2",
    embed_batch_size=32,
)
retriever = RAGNavRetriever(index=index)
result = retriever.retrieve(
    "What is the capital of France?",
    top_k=5,
    expand_structure=False,
    expand_graph=False,
)
print(result.blocks[0].text)
print(result.confidence)

RAGNav architecture

Regenerate this figure: python3 scripts/gen_architecture.py (needs Pillow, e.g. pip install ragnav[dev]).

Long PDFs and paper mode

For papers and long PDFs, use navigation-first routing (pages → evidence → optional link_to refs). That workflow is not required for the markdown snippet above; see Quickstart (Python): papers and the CLI section below.

The problem (why long-document QA fails)

LLMs have finite context windows and degrade on long inputs (“lost in the middle” effects). In long PDFs (papers, reports, manuals), naive retrieval often returns plausible text but misses the right place.

Why classic vector + chunk RAG fails (in PDFs)

  1. Intent mismatch: the query expresses intent; the most similar text isn’t always the most relevant.
  2. Hard chunking breaks meaning: chunks cut across sections/tables/captions, losing provenance and coherence.
  3. Similarity ≠ relevance: many sections look semantically similar (especially in technical documents).
  4. Cross-references: “see Figure 3 / Table 2 / Appendix A / Section 4.1” rarely matches the referenced content.
  5. No navigation: users don’t want “top-k chunks”; they want where the answer lives + traceable evidence.

RAGNav’s approach (navigation-first retrieval loop)

RAGNav is built around a simple loop:

  1. Ingest (paper mode): PDF → blocks with anchors={"page": N} + edges (parent, next, link_to).
  2. Route: query → rank likely pages.
  3. Retrieve: search within routed pages (hybrid BM25 + embeddings).
  4. Expand: add coherence (section headers + adjacent “next” blocks).
  5. Follow refs (optional): traverse link_to edges (Figure/Table/Appendix/Section).
  6. Answer: generate from retrieved evidence (optionally with inline citations).

The “index” (what the model navigates)

RAGNav normalizes everything into a small graph:

Block {
  block_id: "pdf:paper.pdf#b19"
  doc_id: "pdf:paper.pdf"
  text: "..."
  anchors: { page: 5, line_start: 12, line_end: 20 }
}

Edge {
  type: "parent" | "next" | "link_to" | ...
  src: block_id
  dst: block_id
}

This graph is the in-process “index” the retriever navigates: pages, headings, cross-references, and provenance.

Vector RAG vs RAGNav (paper-mode)

Problem Vector + chunks RAGNav (navigation-first)
Find “where” in a paper Not explicit Routes pages + sections
Cross-references (“see Appendix”) Usually missed Follows link_to edges
Provenance Weak (chunk ids) Page + block ids + anchors
Coherence Fragmented Deterministic expansion (parent/next)
Evaluation Ad-hoc Built-in offline suites + scorecard

Use cases

  • Research papers (PDF): page routing + cross-ref following.
  • Reports / manuals / specs: structure-aware retrieval (coherent evidence, not fragments).
  • Grounded answers: inline citations [[block_id]] per sentence (optional).
  • Security baseline: drop prompt-injection blocks and redact obvious secrets (optional).
  • GraphRAG: entity graph + multi-hop traversal with provenance (optional).

Acknowledgements & prior art

RAGNav is an independent project. It builds on long-standing information retrieval practice (lexical BM25, hybrid fusion with dense retrieval) and open embedding models — document structure as a first-class signal has roots in IR, digital libraries, and structured PDF tooling, not a single product.

  • PyMuPDF: PDF text extraction is powered by pymupdf (optional dependency).
  • BM25 / classic IR: Lexical retrieval uses BM25-style scoring.
  • Mistral: Optional LLM/embedding client for chat, routing, and API-backed embed fallback.

RAGNav is not affiliated with the vendors above. If you notice missing or incorrect attribution, please open an issue.


Install

From PyPI (recommended):

pip install ragnav[embeddings]

Optional extras: ragnav[pdf], ragnav[messy] (HTML), ragnav[reranking], ragnav[mistral], ragnav[all] (see pyproject.toml; mistral is omitted from all).

Development (clone)

git clone https://github.com/irfanalidv/RAGNav.git
cd RAGNav
pip install -e ".[dev,pdf,messy]"
# optional: pip install -e ".[mistral]"

Setup (Mistral)

Do not hardcode or commit keys. Use env vars:

export MISTRAL_API_KEY="your_key_here"

Quickstart (CLI): run on an arXiv PDF URL

Install:

pip install "ragnav[mistral,pdf]"
export MISTRAL_API_KEY="..."

Run (recommended: paper-mode navigation):

ragnav paper-pdf --pdf-url "https://arxiv.org/pdf/2507.13334.pdf" --query "What is Context Engineering?"

Jupyter notebook quickstart

Open:

  • cookbook/ragnav_quickstart.ipynb — offline SQuAD demo + confidence + QueryFallback (run in Colab)
  • cookbook/ragnav_paper_quickstart.ipynb

Other modes (optional):

  • Hybrid (BM25 + embeddings, generic PDF blocks):
ragnav hybrid-pdf --pdf-url "https://arxiv.org/pdf/2507.13334.pdf" --query "What is Context Engineering?"
  • Vectorless (BM25-only, generic PDF blocks):
ragnav vectorless-pdf --pdf-url "https://arxiv.org/pdf/2507.13334.pdf" --query "What is Context Engineering?"
  • Agentic retrieval loop:
ragnav agentic-pdf --pdf-url "https://arxiv.org/pdf/2507.13334.pdf" --query "Summarize the paper's main contribution."

Real example output (paper-mode navigation)

This repo includes a paper-mode demo that downloads an arXiv PDF and runs page routing + retrieval:

python3 examples/papers/ragnav_paper_rag_pdf.py \
  --pdf-url "https://arxiv.org/pdf/2507.13334.pdf" \
  --pdf-name "2507.13334.pdf" \
  --max-pages 25

Output (real, trimmed):

## Routed pages
- doc_id=pdf:2507.13334.pdf page=4 score=0.5423 N=3
- doc_id=pdf:2507.13334.pdf page=14 score=0.5298 N=7
- doc_id=pdf:2507.13334.pdf page=9 score=0.4662 N=4
- doc_id=pdf:2507.13334.pdf page=5 score=0.4597 N=3

## Retrieved evidence blocks (first 10)
- page=14  title=Sr-Nle [1130]  id=pdf:2507.13334.pdf#b106
- page=2  title=Related Work  id=pdf:2507.13334.pdf#b11
...

Quickstart (Python): papers (recommended)

PaperRAG (page routing + cross-ref following)

from ragnav.llm.mistral import MistralClient
from ragnav.net import download_pdf
from ragnav.papers import PaperRAG, PaperRAGConfig

llm = MistralClient()
cfg = PaperRAGConfig(max_pages=25, top_pages=4, follow_refs=True)

pdf_bytes = download_pdf("https://arxiv.org/pdf/2507.13334.pdf")
paper = PaperRAG.from_pdf_bytes(pdf_bytes, llm=llm, pdf_name="paper.pdf", cfg=cfg)
print(paper.answer("What experiments were conducted?", cfg=cfg))

Grounded answering (inline citations per sentence)

print(paper.answer_cited("What does Figure 1 show?", cfg=cfg))

Output format:

Sentence one [[pdf:paper.pdf#b12]].
Sentence two [[pdf:paper.pdf#b47]] [[pdf:paper.pdf#b48]].

Quickstart: GraphRAG (entity multi-hop with provenance)

from ragnav.graphrag import build_entity_graph, EntityGraphRetriever

eg = build_entity_graph(blocks)  # blocks are RAGNav Block objects
egr = EntityGraphRetriever(graph=eg, blocks_by_id={b.block_id: b for b in blocks})

out = egr.retrieve("Which dataset was BERT evaluated on?")
for b in out["blocks"][:3]:
    print(b.block_id, b.anchors.get("page"))

Networked PDF demo:

pip install "ragnav[mistral,pdf]"
export MISTRAL_API_KEY="..."
python3 examples/graphs/ragnav_entity_graphrag_pdf.py

Production features

Features PageIndex does not have:

Feature What it does
ConfidenceLevel Every retrieval result carries HIGH/MEDIUM/LOW confidence so you can decide whether to show the answer or say "I'm not sure."
QueryFallback On LOW/MEDIUM confidence, automatically retries with LLM-generated query rephrasing. Prevents silent failures.
CostTracker Tracks token usage and cost per LLM call. Set a budget_usd to get BudgetExceededError before you overspend.
CrossEncoderReranker Optional second-stage reranker with ≥50 first-stage candidates (see retrieve()). On small SQuAD-style corpora the default MS MARCO MiniLM reranker can trail hybrid RRF alone; use domain-tuned models or skip reranking when the pool is easy.
Multi-format ingest PDF, markdown, HTML, email chains, chat logs, legal/numbered documents.
No API key required Runs fully offline with sentence-transformers.

Use any LLM

RAGNav works with any LLM. Built-in: Mistral. For others, one small wrapper:

from openai import OpenAI

from ragnav.llm.base import LLMClient
from ragnav.llm.mistral import MistralClient

# Mistral (built-in) — MISTRAL_API_KEY
llm = MistralClient()

# OpenAI — pip install openai — OPENAI_API_KEY
class OpenAIClient(LLMClient):
    def __init__(self):
        self.client = OpenAI()

    def chat(self, *, messages, model=None, temperature=0.0):
        r = self.client.chat.completions.create(
            model=model or "gpt-4o-mini",
            messages=messages,
            temperature=temperature,
        )
        return r.choices[0].message.content or ""

    def embed(self, *, inputs, model=None):
        r = self.client.embeddings.create(
            model=model or "text-embedding-3-small",
            input=list(inputs),
        )
        return [d.embedding for d in r.data]

# Pass either client to RAGNavRetriever(..., llm=...), PaperRAG(..., llm=...), QueryFallback(..., llm=...).

Anthropic, Groq, Ollama — same pattern (~10 lines each).

Benchmarks

Reproduce with benchmarks/squad_benchmark.py and benchmarks/cuad_benchmark.py after pip install ragnav[embeddings] datasets. No API key for SQuAD or CUAD. Default hybrid path uses RRF; optional cross-encoder reranking is RAGNavRetriever(reranker=...).

Retrieval accuracy

Dataset Method R@1 R@3 R@5 MRR@10
SQuAD BM25-only 0.852 0.932 0.950 0.896
SQuAD Embedding-only 0.772 0.906 0.942 0.844
SQuAD RAGNav hybrid (RRF 0.5/0.5) 0.864 0.956 0.978 0.912
SQuAD Hybrid RRF + cross-encoder reranker 0.862 0.944 0.968 0.906
CUAD (block-level) BM25-only 0.017 0.040 0.044 0.032
CUAD (block-level) RAGNav hybrid (legal ingest + RRF) 0.007 0.047 0.051 0.027
CUAD (block-level) RAGNav + graph expansion 0.007 0.047 0.051 0.027

CUAD — span recall (concatenated top-k blocks)

Gold answer span may sit across legal-ingest block boundaries; span S@k is true if any gold string appears in the concatenation of the top-k retrieved blocks’ text (fairer for clauses).

Dataset Method S@1 S@3 S@5 MRR@10
CUAD (span) BM25-only 0.020 0.061 0.071 0.044
CUAD (span) RAGNav hybrid (legal ingest + RRF) 0.010 0.071 0.074 0.037
CUAD (span) RAGNav + graph expansion 0.010 0.071 0.074 0.037

SQuAD: 500 questions, 447 unique passages, rajpurkar/squad validation set, CC BY-SA 4.0

CUAD: 300 questions sampled (297 with gold locatable in the indexed blocks after legal ingest), theatticusproject/cuad-qa test JSON (official zip), CC BY 4.0. Block-level R@k requires a gold block_id in the top-k list; span S@k only requires the gold answer text to appear in the merged text of those blocks.

vs. PageIndex (illustrative)

PageIndex RAGNav
Requires GPT-4o / paid LLM for core tree workflow Yes No — hybrid retrieval with local embeddings by default
Fully offline (no API key) No Yes
SQuAD R@3 Not published 0.956 (hybrid RRF)
CUAD clause retrieval (span S@3) Not published 0.071 (hybrid RRF + legal ingest; block-level R@3 in results file)
Handles markdown / chat / email No Yes
Structure-aware graph expansion No Yes

FinanceBench: Often cited with a frontier LLM and a finance-PDF setup. RAGNav does not ship that harness: it would imply paid API runs and a different evaluation contract than the offline SQuAD/CUAD suites above. Treat FinanceBench as out of scope for this repo until a reproducible, keyless or clearly documented protocol is added.

One-command scorecard (offline)

python3 -m benchmarks.scorecard

Example output:

{
  "ok": true,
  "suites": [
    { "name": "offline_smoke", "ok": true },
    { "name": "paper_eval", "ok": true, "json": { "suite": "paper_crossref_v1", "follow_refs_true": { "block_hit_rate": 1.0 } } },
    { "name": "entity_eval_excerpt", "ok": true, "json": { "suite": "entity_excerpt_v1" } },
    { "name": "security_eval", "ok": true }
  ]
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragnav-0.3.0.tar.gz (64.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragnav-0.3.0-py3-none-any.whl (77.7 kB view details)

Uploaded Python 3

File details

Details for the file ragnav-0.3.0.tar.gz.

File metadata

  • Download URL: ragnav-0.3.0.tar.gz
  • Upload date:
  • Size: 64.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for ragnav-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c03e16c2a56d77ea97b834ee4aee003dff74fad0700fe3e2adb7f418245d55f3
MD5 5dd9e782660ce0fa67b08ab3dfe65ce2
BLAKE2b-256 54b46d6fdcd9517fe8dfd74ca9008f88fc2a8a3204cb6d2c2c6eb3576e39e421

See more details on using hashes here.

File details

Details for the file ragnav-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: ragnav-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 77.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for ragnav-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3fcb927606a71a85d301087ea401ddb946e31b50fed4326b5bbb7e9e9568fa08
MD5 bcc4d76708663e322bb0438a815725ab
BLAKE2b-256 505a4beefd1c8c8c455a8cf25257149ec190a4e21fee9457e805e2232b7cc116

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page