Prevents silent RAG failures — chunk quality, retrieval fallback, adaptive querying, and answer evaluation in one library.
Project description
ragfallback
ragfallback prevents silent RAG failures across the full pipeline — from bad chunks at ingest, through retrieval outages at runtime, to invisible answer quality degradation in production.
What it prevents
| # | Real production failure | Module | Example |
|---|---|---|---|
| 1 | Query mismatch → silent empty results | AdaptiveRAGRetriever + QueryVariationsStrategy |
uc6_adaptive_rag.py |
| 2 | Embedding model switch corrupts index dimensions | EmbeddingGuard |
uc2_embedding_guard.py |
| 3 | Bad chunks (too short, mid-sentence) poison retrieval | ChunkQualityChecker |
uc3_chunk_quality.py |
| 4 | Retrieved chunks overflow LLM context window | ContextWindowGuard |
uc4_context_window.py |
| 5 | Keyword queries fail dense retrieval silently | SmartThresholdHybridRetriever |
uc5_hybrid_failover.py |
| 6 | Primary retriever outage returns empty, no fallback | FailoverRetriever |
uc5_hybrid_failover.py |
| 7 | Multi-step questions always fail single-shot RAG | MultiHopFallbackStrategy |
uc6_multi_hop_demo.py |
| 8 | Index serves stale data after document updates | StaleIndexDetector |
— |
| 9 | Answer quality invisible in production | RAGEvaluator |
uc7_rag_evaluator.py |
| 10 | Cross-boundary answers lost between adjacent chunks | OverlappingContextStitcher |
uc8_context_stitcher.py |
| 11 | Metric regression after model/embedder/chunker change | GoldenRunner + BaselineRegistry |
examples/ci_regression_gate.py |
Quick start
pip install ragfallback[chroma,huggingface,real-data]
from datasets import load_dataset
from langchain_core.documents import Document
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from ragfallback.diagnostics import ChunkQualityChecker, EmbeddingGuard, RetrievalHealthCheck
from ragfallback.evaluation import RAGEvaluator
# 1 — load 50 real Wikipedia passages (SQuAD, CC BY-SA 4.0)
ds = load_dataset("rajpurkar/squad", split="validation")
seen, docs, probes = set(), [], []
for row in ds:
ctx = row["context"].strip()
if ctx not in seen and len(seen) < 50:
seen.add(ctx)
docs.append(Document(page_content=ctx, metadata={"source": "squad"}))
if row["answers"]["text"]:
probes.append({"question": row["question"],
"ground_truth": row["answers"]["text"][0]})
print(f"Loaded {len(docs)} real passages, {len(probes)} Q&A pairs")
# 2 — check chunk quality before embedding
report = ChunkQualityChecker().check(docs)
print(report.summary())
# 3 — guard embedding dimensions before writing to any index
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
EmbeddingGuard(expected_dim=384).validate(embeddings).raise_if_failed()
# 4 — build index and smoke-test retrieval with real Q&A probes
store = Chroma.from_documents(docs, embeddings, persist_directory="./my_index")
health = RetrievalHealthCheck(k=4).run_substring_probes(
store,
{p["question"]: p["ground_truth"][:50] for p in probes[:10]},
)
print(f"Retrieval hit rate: {health.hit_rate:.0%}")
# 5 — evaluate answer quality on a real question
question = probes[0]["question"]
retrieved = store.as_retriever(search_kwargs={"k": 4}).invoke(question)
answer = retrieved[0].page_content if retrieved else "Not found"
score = RAGEvaluator().evaluate(
question, answer,
[d.page_content for d in retrieved],
ground_truth=probes[0]["ground_truth"],
)
print(score.report())
Expected output (actual numbers — run it yourself):
Loaded 50 real passages, 2627 Q&A pairs
[PASS] chunks=50 | len min/avg/max=144/618/2095
Retrieval hit rate: 100%
========================================================
RAG evaluation
========================================================
Context precision : 100.00%
Faithfulness : 95.00%
Answer relevance : 40.00%
Recall (gold hit) : 100.00%
Overall : 84.00%
Pass (>=70%) : True
Configuration
Most features work with no API key — chunk checking, embedding validation, hybrid retrieval, and evaluation all run locally.
LLM-dependent features (AdaptiveRAGRetriever, QueryVariationsStrategy, MultiHopFallbackStrategy) need a model. Copy .env.example to .env and fill in:
cp .env.example .env
MISTRAL_API_KEY=your_key_here
MISTRAL_MODEL=mistral-small-latest # default, override if needed
Get a free Mistral key at console.mistral.ai. The library also supports any LangChain-compatible LLM — pass it directly to AdaptiveRAGRetriever(llm=your_llm).
Full pipeline
Your documents
│
▼
[ChunkQualityChecker] ← bad splits, short/duplicate chunks
│
▼
[EmbeddingGuard] ← dimension / NaN / zero-vector checks before write
[EmbeddingQualityProbe] ← domain mismatch heuristic (generic model on jargon)
[sanitize_documents] ← JSON-safe metadata before any vector store write
│
▼
Vector store (Chroma / FAISS / Qdrant / …)
│
▼
[StaleIndexDetector] ← SHA256 manifest: source files vs last build
│
▼
[RetrievalHealthCheck] ← labeled recall@k or quick substring smoke probes
│
▼
[SmartThresholdHybridRetriever] ← threshold + optional BM25 fallback
[FailoverRetriever] ← primary → fallback on exception or empty results
│
▼
[ContextWindowGuard] ← rank + trim chunks to token budget (8 model presets)
[OverlappingContextStitcher] ← merge adjacent chunks from same source
│
▼
[AdaptiveRAGRetriever] ← QueryVariationsStrategy / MultiHopFallbackStrategy
│
▼
[RAGEvaluator] ← recall@k, nDCG, faithfulness (heuristic + LLM judge)
Module reference
ragfallback.diagnostics
ChunkQualityChecker — detects too-short, too-long, mid-sentence, and duplicate chunks before embedding.
from ragfallback.diagnostics import ChunkQualityChecker
report = ChunkQualityChecker(min_chars=100, max_chars=8000).check(docs)
if report.has_issues:
fixed = ChunkQualityChecker().auto_fix(docs)
EmbeddingGuard — validates dimension, NaN, and zero-vectors before writing to any index.
from ragfallback.diagnostics import EmbeddingGuard
guard = EmbeddingGuard(expected_dim=384)
guard.validate(embeddings_model).raise_if_failed() # model-level check
guard.validate_raw_vectors(vectors).raise_if_failed() # pre-computed vectors
EmbeddingQualityProbe — heuristic domain-fit check: if similarity scores are uniformly low, the model is likely a poor domain match.
from ragfallback.diagnostics import EmbeddingQualityProbe
result = EmbeddingQualityProbe().run(embeddings, query="...", reference_snippets=[...])
if not result.ok:
print(result.warnings) # "consider domain-specific model"
RetrievalHealthCheck — labeled recall@k or quick substring smoke probes against a live vector store.
from ragfallback.diagnostics import RetrievalHealthCheck
health = RetrievalHealthCheck(k=5)
report = health.run_substring_probes(vector_store, {"What is Python?": "high-level language"})
print(report.hit_rate, report.avg_latency_ms)
StaleIndexDetector — SHA256 manifest to catch when source files changed since last index build.
from ragfallback.diagnostics import StaleIndexDetector
det = StaleIndexDetector(manifest_path="./index_manifest.json")
det.record_paths(["./docs/policy.md"]) # record after build
report = det.check_paths(["./docs/policy.md"]) # check before serving
if report.has_stale:
print(report.summary())
ContextWindowGuard — ranks and trims retrieved chunks to fit a token budget; 8 model presets included.
from ragfallback.diagnostics import ContextWindowGuard
guard = ContextWindowGuard.from_model_name("gpt-4o")
selected, report = guard.select(query, retrieved_docs, embeddings)
OverlappingContextStitcher — merges consecutive chunks from the same source so cross-boundary answers aren't split.
from ragfallback.diagnostics import OverlappingContextStitcher
merged = OverlappingContextStitcher().stitch(retrieved_docs)
sanitize_documents — normalizes list/dict/bytes metadata to JSON-safe scalars before any vector store write.
from ragfallback.diagnostics import sanitize_documents
clean_docs = sanitize_documents(dirty_docs) # safe for Chroma, Pinecone, Qdrant
ragfallback.retrieval
SmartThresholdHybridRetriever — score-threshold gating with automatic BM25 fallback when dense scores are weak. Supports distance, similarity, and relative score modes.
from ragfallback.retrieval import SmartThresholdHybridRetriever
retriever = SmartThresholdHybridRetriever.from_documents(
docs, embeddings, dense_threshold=0.5, k=4
) # pip install ragfallback[hybrid] for BM25
FailoverRetriever — if the primary retriever raises or returns fewer than min_results docs, automatically switches to a secondary.
from ragfallback.retrieval import FailoverRetriever
retriever = FailoverRetriever(primary=chroma_retriever, fallback=faiss_retriever, min_results=1)
ragfallback.core
AdaptiveRAGRetriever — wraps a vector store with retry logic and pluggable fallback strategies. On each attempt it retrieves, scores confidence, and either returns the answer or tries the next strategy.
from ragfallback import AdaptiveRAGRetriever
from ragfallback.strategies import QueryVariationsStrategy
retriever = AdaptiveRAGRetriever(
vector_store=store,
llm=llm, # any LangChain LLM
strategies=[QueryVariationsStrategy(num_variations=2)],
confidence_threshold=0.7,
max_attempts=3,
)
result = retriever.retrieve("What is the refund policy?")
print(result.answer, result.confidence, result.attempts_used)
Requires MISTRAL_API_KEY (or any LangChain-compatible LLM passed via llm=).
ragfallback.strategies
QueryVariationsStrategy — LLM rewrites the original query into N variations to broaden retrieval recall. Requires an LLM.
MultiHopFallbackStrategy — decomposes complex multi-step questions into sub-questions, retrieves each independently, then synthesises a final answer. Requires an LLM.
from ragfallback.strategies import MultiHopFallbackStrategy
result = MultiHopFallbackStrategy(max_hops=3).run(question, retriever, llm)
print(result.final_answer, result.total_hops)
ragfallback.tracking
CostTracker — token cost ledger for a RAG session. Records spend per operation, enforces an optional budget ceiling, and surfaces a report at the end.
from ragfallback import CostTracker
tracker = CostTracker(budget_usd=1.0)
tracker.record(model="mistral-small-latest", input_tokens=500, output_tokens=200)
print(tracker.get_report()) # total cost, budget remaining
MetricsCollector — records latency, success/failure counts, and confidence scores across retrieval attempts.
from ragfallback import MetricsCollector
metrics = MetricsCollector()
# passed automatically to AdaptiveRAGRetriever; or record manually:
metrics.record_attempt(success=True, latency_ms=120, confidence=0.85)
print(metrics.get_stats())
ragfallback.evaluation
RAGEvaluator — scores recall@k, nDCG, and faithfulness without external services. Optional LLM judge hook for higher accuracy.
from ragfallback.evaluation import RAGEvaluator
ev = RAGEvaluator()
score = ev.evaluate(question, answer, context_docs, ground_truth="...")
print(score.overall_score, score.faithfulness_score, score.recall_at_k)
print(ev.batch_summary([score]))
Examples — real public datasets
| Example | Dataset | Command |
|---|---|---|
| UC-1: retrieval health | SQuAD Wikipedia | python examples/uc1_retrieval_health.py |
| UC-2: embedding guard | — (dimension check) | python examples/uc2_embedding_guard.py |
| UC-3: chunk quality | SQuAD Wikipedia | python examples/uc3_chunk_quality.py |
| UC-4: context window | sample KB | python examples/uc4_context_window.py |
| UC-5: hybrid + failover | FAISS + BM25 | python examples/uc5_hybrid_failover.py |
| UC-6: adaptive RAG | SQuAD Wikipedia (needs MISTRAL_API_KEY or Ollama) |
python examples/uc6_adaptive_rag.py |
| UC-7: RAG evaluator | PubMedQA (MIT) — real medical Q&A | python examples/uc7_rag_evaluator.py |
| UC-8: context stitcher | ChromaDB + HR chunks | python examples/uc8_context_stitcher.py |
| UC-9: embedding probe | — (similarity check) | python examples/uc9_embedding_probe.py |
| UC-10: metadata sanitizer | ChromaDB dirty docs | python examples/uc10_metadata_sanitizer.py |
| End-to-end on SQuAD | SQuAD Wikipedia (CC BY-SA 4.0) | python examples/real_data_demo.py |
| Financial news RAG | nickmuchi/financial-classification (Apache 2.0) | python examples/financial_risk_analysis.py |
| Legal contract RAG | theatticusproject/cuad-qa (CC BY 4.0) | python examples/legal_document_analysis.py |
| Medical abstract RAG | qiaojin/PubMedQA (MIT) | python examples/medical_research_synthesis.py |
| MLOps: build golden dataset | SQuAD (CC BY-SA 4.0) + SciQ (CC BY-NC 3.0) | python examples/build_golden_dataset.py |
| MLOps: full demo | SQuAD golden set, zero API keys | python examples/mlops_demo.py |
| MLOps: CI regression gate | SQuAD golden set, committed baseline | python examples/ci_regression_gate.py |
Verified numbers — SQuAD Wikipedia validation set
python examples/real_data_demo.py runs every module on 200 real Wikipedia passages. Numbers below are printed by the script on every run — not made up.
Passages indexed : 200 real Wikipedia passages
Q&A pairs : 10 570 (ground truth available)
ChunkQualityChecker : 1 violation (avg 662 chars/passage)
EmbeddingGuard : OK — dim 384 matches expected 384
RetrievalHealthCheck (20 real Q&A substring probes):
Hit rate : 100.0%
Avg latency: 25 ms per query
RAGEvaluator (10 real Q&A pairs, heuristic, no LLM judge):
Pass rate : 2/10 (heuristic; rises with LLM judge)
Avg recall@k : 100.0%
Avg faithfulness : 79.5%
Avg overall : 62.9%
Install: pip install ragfallback[chroma,huggingface,real-data]
Dataset: rajpurkar/squad — CC BY-SA 4.0
Install
pip install ragfallback # core only
pip install ragfallback[chroma,huggingface] # golden path (no API keys)
pip install ragfallback[faiss,huggingface] # FAISS instead of Chroma
pip install ragfallback[hybrid] # adds BM25 (rank_bm25)
pip install ragfallback[real-data] # real dataset examples (HuggingFace datasets)
pip install ragfallback[mlops] # MLOps eval layer (RAGAS + MLflow + Locust)
| Extra | Installs |
|---|---|
chroma |
chromadb |
faiss |
faiss-cpu |
huggingface |
sentence-transformers, huggingface-hub |
hybrid |
rank_bm25, langchain-community |
real-data |
datasets |
openai |
langchain-openai, openai |
mlops |
ragas, mlflow, locust, aiohttp |
Subpackage import map
from ragfallback import AdaptiveRAGRetriever, QueryResult, CostTracker, MetricsCollector
from ragfallback.diagnostics import (
ChunkQualityChecker, EmbeddingGuard, EmbeddingQualityProbe,
RetrievalHealthCheck, StaleIndexDetector, ContextWindowGuard,
OverlappingContextStitcher, sanitize_documents, sanitize_metadata,
)
from ragfallback.retrieval import SmartThresholdHybridRetriever, FailoverRetriever
from ragfallback.strategies import QueryVariationsStrategy, MultiHopFallbackStrategy
from ragfallback.evaluation import RAGEvaluator
from ragfallback.mlops import (
RagasHook, RagasReport,
BaselineRegistry, RegressionError,
GoldenRunner, GoldenReport,
QuerySimulator, SimQuery,
MLflowLogger,
generate_locustfile,
)
MLOps — Evaluation & Regression Gate
ragfallback ships a complete MLOps evaluation layer for RAG pipelines. No API keys required — all metrics use local heuristics by default, with optional RAGAS + MLflow when installed.
Install
pip install ragfallback[chroma,huggingface,real-data,mlops]
Full eval loop
import asyncio
from ragfallback.mlops import GoldenRunner, RagasHook, BaselineRegistry
# 1 — Build evaluation hook (heuristic by default; RAGAS when installed)
hook = RagasHook(llm=None, embeddings=embeddings)
# 2 — Run against 75 real SQuAD QA pairs
runner = GoldenRunner(
retriever=retriever, # AdaptiveRAGRetriever instance
ragas_hook=hook,
dataset="examples/golden_qa.json",
)
report = asyncio.run(runner.run_async())
print(f"Recall@3 : {report.recall_at_3:.3f}")
print(f"Faithfulness : {report.ragas.faithfulness:.3f}")
print(f"Latency P95 : {report.latency_p95_ms:.0f}ms")
print(f"Fallback rate : {report.fallback_rate:.1%}")
# 3 — Regression gate: fails if any metric drops > 5% vs baseline
registry = BaselineRegistry("baselines.json")
registry.compare_or_fail(report, dataset="my_dataset") # raises RegressionError if degraded
registry.update(report, dataset="my_dataset") # save new baseline
Adversarial query simulation
from ragfallback.mlops import QuerySimulator
sim = QuerySimulator()
queries = ["What is the refund policy?", "How do API rate limits work?"]
# 4 types: short_keyword, long_nl, ambiguous, out_of_domain
mixed = sim.simulate(queries)
# All 4 types for every query — for stress testing
unhappy = sim.simulate_unhappy_paths(queries)
Load testing
from ragfallback.mlops import generate_locustfile
generate_locustfile("locustfile.py", endpoint="http://localhost:8000")
# Run: locust -f locustfile.py --host http://localhost:8000 --users 50
CI regression gate (GitHub Actions)
The included workflow (mlops-regression-gate job in .github/workflows/test.yml)
runs on every push to main:
- Pulls 75 SQuAD samples from HuggingFace (open data, CC BY-SA 4.0)
- Indexes them in ChromaDB using
all-MiniLM-L6-v2(no API key) - Runs
GoldenRunnerasync — computes recall@3, recall@5, latency P95 - Calls
compare_or_fail()againstexamples/baselines.json(committed) - Fails the pipeline if any metric regresses more than 5%
# Run the CI gate locally
python examples/build_golden_dataset.py # one-time setup
python examples/ci_regression_gate.py # exits 0 (pass) or 1 (fail)
Contributing
See CONTRIBUTING.md. The quick version: run pytest tests/unit/ -v before any PR, follow Google-style docstrings, use logging not print, and update __all__ in the subpackage __init__.py.
License · Changelog
MIT License — see LICENSE.
Full version history in CHANGELOG.md.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragfallback-2.1.0.tar.gz.
File metadata
- Download URL: ragfallback-2.1.0.tar.gz
- Upload date:
- Size: 108.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
214558034cba8c148d49402d3430a33fd30021e41d1a901f83e9af2644869c7a
|
|
| MD5 |
63ba35dcb14cce1b70bc311f703b9781
|
|
| BLAKE2b-256 |
3c3e5b75e9b3ad42f9d400bba9c67f9050f5c1bad4be4607ed8c3842a49bd5a1
|
File details
Details for the file ragfallback-2.1.0-py3-none-any.whl.
File metadata
- Download URL: ragfallback-2.1.0-py3-none-any.whl
- Upload date:
- Size: 72.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1d6ec3de4047aad74922112b8e2ff425839a25fa8ad1de53af8f521b05c3f19
|
|
| MD5 |
e64d494f45d6100a9ad94edf07a6753c
|
|
| BLAKE2b-256 |
719931984f210d32f3ebe10f067f9addfa84631867872135520e25ff9db0d679
|