Document-grounded Q&A with sentence-level citations and faithfulness verification
Project description
verifiable-rag
Document-grounded Q&A with sentence-level citations, NLI verification, and calibrated refusal.
Status: pre-alpha · v0.5 launch sprint · interfaces are still subject to change
📚 Full documentation at firish.github.io/rag-rack — quickstart, concept guides, how-to recipes, API reference, benchmark reports.
What this is
A Python library for building RAG pipelines that:
- Produce sentence-level citations — every generated sentence traces back to exact source spans
(doc_id, page, char_start, char_end). - Verify every claim via NLI against its cited span before returning it.
- Refuse when uncertain — calibrated abstention with a user-tunable strictness slider, not a "say I don't know" prompt.
- Are fully auditable — inspect retrieval scores, reranker decisions, per-claim NLI results, and a self-contained HTML report per query.
One benchmark result that drives the design: on RAGTruth (the canonical 2,700-example RAG hallucination benchmark), a dual NLI ensemble of two small open-source models (HHEM-2.1-open + MiniCheck-Flan-T5-Large) matches Claude Sonnet 4.6 as a judge — AUROC 0.844 vs 0.846 — at ~250× lower per-call cost. Full result in benchmarks/PUBLISHED_ragtruth.md.
Quickstart
The bundled demo document ships with the package. No setup required beyond an LLM API key:
import verifiable_rag
from verifiable_rag.demo import sample_paper_path
answer = verifiable_rag.ask(
"What is the mechanism of action of penicillin?",
docs=sample_paper_path(),
)
print(answer.text)
export ANTHROPIC_API_KEY=...
python -c "import verifiable_rag; from verifiable_rag.demo import sample_paper_path; \
print(verifiable_rag.ask('Who discovered penicillin?', docs=sample_paper_path()).text)"
For an actual production setup, point docs= at your own PDFs and pick a preset:
import verifiable_rag
answer = verifiable_rag.ask(
"What did the authors find?",
docs=["paper1.pdf", "paper2.pdf"],
preset="hybrid_balanced", # RECOMMENDED — Cohere + Dual NLI + Haiku
output_html="audit.html", # optional — write the HTML audit report
)
print(answer.text)
# Programmatic access to the audit trail:
for sentence in answer.unsupported_sentences: # sentences the verifier flagged
print(f"⚠ unsupported: {sentence.text}")
# Or emit a structured audit dump for logging / metrics:
metrics_client.emit(answer.audit_trail())
See examples/ for runnable demos covering the headline UX patterns. The full quickstart walks through each step in detail.
Presets
Five named presets cover most use cases. Switch via preset="..." or call the factories directly:
| Preset | Components | Required keys | When to use |
|---|---|---|---|
local_minimal |
BGE + PyMuPDF + Haiku, no verifier | ANTHROPIC_API_KEY |
Hobbyist / quickest start |
local_verified |
+ BGE rerank + HHEM NLI | ANTHROPIC_API_KEY |
Local with verification |
hybrid_balanced |
Docling + Cohere + Dual NLI + constrained Haiku | ANTHROPIC_API_KEY + COHERE_API_KEY |
Default — the published baseline |
hybrid_strict |
Same as balanced, refuse below faithfulness 0.7 | same | Higher-trust use cases |
hybrid_paranoid |
Sonnet generator, refuse below faithfulness 0.9 | same | Compliance / high-trust |
For mix-and-match outside the presets, use verifiable_rag.build_pipeline(...) or load a YAML config (see examples/pipeline.yaml, Pipeline.from_yaml(), and the YAML config guide).
Architecture
PDF/DOCX → Parser → Document model → Chunker → Indexer
↓
Answer ← Abstention ← Verifier ← Generator ← Retriever + Reranker
Every step preserves character-level spans. Every generated sentence carries (supporting_sentence_ids, confidence) linked to exact source locations. Citation granularity is decoupled from chunk granularity by design.
Audit trail
Every Answer exposes its full audit trail:
answer = verifiable_rag.ask(question, docs=...)
answer.text # final answer string
answer.sentences # list of CitedSentence with supporting_sentence_ids
answer.verification_results # per-sentence NLI checks
answer.retrieved_chunks # the reranked passages the generator saw
# Convenience accessors:
answer.supported_sentences # list[CitedSentence] (passed verification)
answer.unsupported_sentences # list[CitedSentence] (verifier flagged)
answer.verification_for(idx) # VerificationResult | None for a sentence index
answer.cited_sentence_ids # frozenset of all source IDs cited
answer.min_nli_score # worst-case sentence — the bottleneck
answer.audit_trail() # JSON-serializable dict for logging / metrics
# Or render the full audit as a self-contained HTML page:
answer.to_html() # returns HTML string
# or pass output_html="report.html" to verifiable_rag.ask()
The HTML report includes the query, the answer with per-sentence verification color coding, the faithfulness components, per-sentence NLI scores, and every reranked passage with its retrieval score — citations are anchored links into the passage list.
Installation
pip install verifiable-rag # core, no heavy deps
pip install "verifiable-rag[docling,bge,lancedb]" # parser + embedder + index
pip install "verifiable-rag[hhem,minicheck]" # NLI verifiers (adds torch + transformers)
pip install "verifiable-rag[litellm]" # LLM-judge verifier
pip install "verifiable-rag[yaml]" # YAML config loader
pip install "verifiable-rag[all]" # everything
First-run model downloads
Verifier model weights are not bundled in the wheel — they're downloaded lazily from HuggingFace Hub on first use and cached forever in ~/.cache/huggingface/hub/.
| Verifier | Model | Size |
|---|---|---|
HHEMVerifier |
vectara/hallucination_evaluation_model |
~600 MB |
MiniCheckVerifier |
lytang/MiniCheck-Flan-T5-Large |
~770 MB |
LLMJudgeVerifier |
(hosted API, no local model) | 0 |
Published benchmark results
| Benchmark | Headline | Report | Blog post |
|---|---|---|---|
| ALCE (Princeton citation quality) | Constrained decoding beats prompted by +4–7 F1 under dual-LLM-judge cross-validation | report | post |
| RAGTruth (hallucination detection) | Dual NLI ensemble = Sonnet judge at 1/250× the cost (AUROC 0.844 vs 0.846) | report | post |
| LitQA2 (biomedical scientific Q&A) | Constrained decoding lifts MC; contextual retrieval is a null result on saturated retrieval | report | post |
Roadmap
| Phase | Milestone | Status |
|---|---|---|
| 0–1 | Repo skeleton, data model, baseline pipeline | ✅ done |
| 2 | Eval harness + BENCHMARKS.md | ✅ done |
| 3 | Sentence-level citations (prompted / constrained / SAFE) | ✅ done |
| 4 | Faithfulness verification + calibrated refusal (v0.4) | ✅ done |
| 5 | Hardening, mkdocs docs, Gradio demo on HF Spaces (v0.5) | in progress |
| 6 | Launch — PyPI release + Show HN | pending |
Contributing
See CLAUDE.md for architecture decisions, hard rules, and contribution conventions. Methodology critiques on the published benchmarks are especially welcome — eval rigor is the whole moat, and the only way to find the holes is to invite people to look for them.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file verifiable_rag-0.5.0.tar.gz.
File metadata
- Download URL: verifiable_rag-0.5.0.tar.gz
- Upload date:
- Size: 146.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c0712d80341eb88233647cc351c22a10ed7cbf73ffa4256e6f8f6a245c97386
|
|
| MD5 |
2c3f7779e9c62cc4c807d079b70e6bb2
|
|
| BLAKE2b-256 |
94ddeb46fbf1b17a74fb9e2261d0e44241db4967523f40c424e3bf90d75d0810
|
Provenance
The following attestation bundles were made for verifiable_rag-0.5.0.tar.gz:
Publisher:
publish.yml on firish/rag-rack
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
verifiable_rag-0.5.0.tar.gz -
Subject digest:
3c0712d80341eb88233647cc351c22a10ed7cbf73ffa4256e6f8f6a245c97386 - Sigstore transparency entry: 1679516268
- Sigstore integration time:
-
Permalink:
firish/rag-rack@6733def1a561e8ff7b776259a8fe1078ae3b33d1 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/firish
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6733def1a561e8ff7b776259a8fe1078ae3b33d1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file verifiable_rag-0.5.0-py3-none-any.whl.
File metadata
- Download URL: verifiable_rag-0.5.0-py3-none-any.whl
- Upload date:
- Size: 177.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27d5627579b325307222b32dd9fc8efea414f245cad6173354c0ba6bf1715660
|
|
| MD5 |
46c726d72e3f58b832a5576da076a60c
|
|
| BLAKE2b-256 |
a35bbec47d88c3dfb9f16290e94148c6198791ec268364d122b3029cf9b91f06
|
Provenance
The following attestation bundles were made for verifiable_rag-0.5.0-py3-none-any.whl:
Publisher:
publish.yml on firish/rag-rack
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
verifiable_rag-0.5.0-py3-none-any.whl -
Subject digest:
27d5627579b325307222b32dd9fc8efea414f245cad6173354c0ba6bf1715660 - Sigstore transparency entry: 1679516721
- Sigstore integration time:
-
Permalink:
firish/rag-rack@6733def1a561e8ff7b776259a8fe1078ae3b33d1 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/firish
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6733def1a561e8ff7b776259a8fe1078ae3b33d1 -
Trigger Event:
push
-
Statement type: