Framework-agnostic benchmarking for hybrid RAG retrieval pipelines
Project description
retrieval-observatory (retobs)
Most RAG evaluation tools score end-to-end answer quality and stop there. They don't tell you which stage helped, what it cost in latency, or which queries will fail before you run retrieval. retobs is an open-source multi-stage retrieval benchmark and local dashboard that measures per-stage contribution, failure diagnosis, latency–quality tradeoffs, and query difficulty — so you can decide whether to add that reranker (or switch to dense) with evidence, not intuition.
Headline result: On BEIR/FiQA, dense retrieval (all-MiniLM-L6-v2) outperforms BM25 by +132% NDCG@10 (0.369 vs 0.159) at ~130× lower latency than cross-encoder reranking. On SciFact and FiQA, dense-only is the sole Pareto-optimal pipeline. On NFCorpus, dense/rerank/RRF NDCG CIs overlap — no single winner on quality alone.
Quality–Latency Tradeoff — NFCorpus Pareto frontier
Install
pip install "retrieval-observatory[demo,dashboard,dense]"
For development from source:
git clone https://github.com/AmeyaKI/retrieval-observatory.git && cd retrieval-observatory
python -m venv .venv && source .venv/bin/activate
pip install -e ".[demo,dashboard,dense]"
Quickstart (~5 minutes)
Run BM25 on 50 SciFact queries, then open the dashboard.
PyPI install (bundled example config):
CFG="$(python -c 'from retrieval_observatory import EXAMPLES_DIR; print(EXAMPLES_DIR / "quickstart_scifact.yaml")')"
retobs validate --config "$CFG"
retobs run --config "$CFG"
retobs serve --db .retobs/quickstart_scifact.db
From a git clone (repo examples/ tree):
retobs validate --config examples/quickstart_scifact.yaml
retobs run --config examples/quickstart_scifact.yaml
retobs serve --db .retobs/quickstart_scifact.db
Open http://localhost:8000 — explore metrics, latency, and query-level diagnostics.
Full examples and BEIR publish configs
The PyPI wheel includes quickstart YAMLs only. For the full examples/ demos (HTTP quickstart, temporal demo, dashboard demo with JSONL data) and multi-dataset BEIR sweeps, clone the repo:
git clone https://github.com/AmeyaKI/retrieval-observatory.git
cd retrieval-observatory
./scripts/run_beir_publish.sh full-sweep # uses configs/beir_publish/
Benchmark Results
Cross-dataset summary (full BEIR test splits, 4 independent pipelines). See results/BENCHMARK_ANALYSIS.md for motivation, Pareto analysis, classifier calibration, and limitations.
| Dataset | bm25 NDCG@10 | dense_only | rrf_hybrid | bm25__rerank | Pareto optimal |
|---|---|---|---|---|---|
| NFCorpus (323q) | 0.264 | 0.310 | 0.304 | 0.310 | bm25, dense_only |
| SciFact (300q) | 0.544 | 0.640 | 0.623 | 0.628 | dense_only |
| FiQA (648q) | 0.159 | 0.369 | 0.290 | 0.260 | dense_only |
Four pipelines: bm25, dense_only, rrf_hybrid, bm25__rerank. Stage attribution uses the bm25 → bm25__rerank prefix pair only. JSON exports and regeneration: results/RESULTS_OVERVIEW.md.
What retobs tells you
Stage Contribution: bm25 → bm25__rerank
┌───────────────┬──────────┬──────────┬──────────────┬────────────────┐
│ Metric │ Before │ After │ Δ │ Significant? │
├───────────────┼──────────┼──────────┼──────────────┼────────────────┤
│ recall@10 │ 0.1190 │ 0.1380 │ +0.0190 (+16%)│ q=0.041 ✓ │
│ ndcg@10 │ 0.2640 │ 0.3100 │ +0.0460 (+17%)│ q=0.012 ✓ │
│ Latency P50 │ 2ms │ 4,057ms │ +4,055ms │ — │
└───────────────┴──────────┴──────────┴──────────────┴────────────────┘
- Stage attribution — What did each stage add in quality, cost, and latency? BH-corrected significance on paired queries.
- Failure diagnosis — Candidate misses, lexical mismatches, reranker drops — labeled per query.
- Latency–quality tradeoff — Pareto frontier and budget slider; see whether reranking is worth it at your latency budget.
Core promise:
- Comparable Recall@K, NDCG@K, MRR, MAP, latency percentiles, and estimated cost per 1k queries across pipelines.
- Multi-stage pipelines with independent stage analysis and temporal recall for time-sensitive datasets.
How It's Different
| Tool | What it measures |
|---|---|
| BEIR | End-to-end pipeline accuracy on fixed datasets |
| RAGAs / TruLens | Answer quality given retrieved context |
| retobs | Per-stage contribution: what did each stage add in quality, cost, and latency? |
retobs is not a leaderboard and not an answer evaluator. It's a diagnostic layer between "I have a retrieval pipeline" and "I understand how to improve it."
Install (development)
python -m venv .venv
source .venv/bin/activate
# Full local development setup
pip install -e ".[demo,dashboard,dense,dev,llm-judge]"
For a smaller install:
pip install -e ".[demo,dashboard]"
Stage Attribution in 60 Seconds
Add ablations: true to your combinations config and retobs automatically runs the prefix pipeline too:
stages:
bm25:
type: adapter.bm25
config: {k: 100}
rerank:
type: adapter.hf_crossencoder
config:
model: cross-encoder/ms-marco-MiniLM-L-6-v2
k: 10
combinations:
include:
- [bm25, rerank]
ablations: true # automatically also runs [bm25] alone — no extra config needed
retobs run then prints the stage contribution table showing exactly what the reranker added.
For a 3-stage pipeline, ablations: true generates all valid ordered subsequences — not just prefixes:
combinations:
include:
- [bm25, fast_rerank, precise_rerank]
ablations: true
# Generates: bm25 | bm25__fast_rerank | bm25__precise_rerank | bm25__fast_rerank__precise_rerank
# Answers: does skipping fast_rerank and going direct to precise_rerank beat the cascade?
To test only whether a specific stage pays for itself, name it explicitly:
combinations:
include:
- [bm25, fast_rerank, precise_rerank]
ablations: [fast_rerank] # generates only: without fast_rerank vs with fast_rerank
Optionally set a latency budget to get a one-line verdict in CI:
retobs run --config my_experiment.yaml --latency-budget-ms 1000
Query Difficulty Classifier
Predict whether a query will be hard for retrieval before running your pipeline, using only query text. Labels come from post-hoc diagnostics (mean Recall across pipelines on a specific corpus), so models are dataset-specific.
# Install classifier dependencies
pip install -e ".[classifier]"
# After one or more benchmark runs on the same dataset:
retobs classifier train --dataset beir/nfcorpus
# Inspect cross-val accuracy, Brier score, and feature importances:
retobs classifier report --dataset beir/nfcorpus
# Score a single query:
retobs classifier predict --model .retobs/models/query_difficulty_beir_nfcorpus.joblib \
--query "What mitochondrial mechanisms were studied since 2019?"
# Next benchmark run auto-applies predictions when a matching model exists
retobs run --config my_experiment.yaml
The dashboard shows Classifier Calibration: mean Recall@10 (with bootstrap CIs) grouped by predicted difficulty. If predicted-hard queries have lower Recall@10 than predicted-easy ones, the classifier is doing useful work.
Caveat: The classifier predicts observatory difficulty under your pipelines on your corpus—not intrinsic question hardness. Train and evaluate on the same dataset; cross-dataset use is unsupported.
HTTP Quickstart
If your retrieval service is already running, point retobs at it and get metrics immediately:
# Start the mock server
pip install fastapi uvicorn rank-bm25
uvicorn examples.http_quickstart.server:app --port 8000
# Benchmark it
retobs run --config examples/http_quickstart/config.yaml
The HTTP adapter POSTs {"query": str, "k": int} and expects {"results": [{"id", "text", "score"}]}.
Quick Test Of The Observatory
# 1. Install/update editable package
source .venv/bin/activate
pip install -e ".[demo,dashboard,dense,dev,llm-judge]"
# 2. Confirm CLI commands are registered
retobs --help
# 3. Generate a starter experiment config
retobs init --mode bm25+reranker --output my_experiment.yaml
# 4. Validate before running
retobs validate --config my_experiment.yaml
# 5. Run the benchmark (stage attribution table printed automatically)
retobs run --config my_experiment.yaml --no-cache
# 6. Open the interactive dashboard
retobs serve --db .retobs/results.db --port 8000
Open http://localhost:8000 — move the latency budget slider and watch the stage verdict update live.
Load multiple result databases in one dashboard (sidebar tabs per DB):
retobs serve --db .retobs/publish_smoke_scifact.db --db .retobs/dashboard_demo.db
# or comma-separated:
retobs serve --db .retobs/a.db,.retobs/b.db
# or env var (colon-separated):
RETOBS_DASHBOARD_DBS=.retobs/a.db:.retobs/b.db retobs serve
YAML Stage Combinations
You can define stages once and ask retobs to expand the exact combinations you want to benchmark.
experiment:
name: my-rag-sweep
dataset:
type: custom
name: custom
queries_path: data/queries.jsonl
corpus_path: data/corpus.jsonl
timestamp_field: timestamp
metadata_fields: [source]
stages:
bm25:
type: adapter.bm25
config: {k: 100}
dense:
type: adapter.hf_biencoder
config:
model: sentence-transformers/all-MiniLM-L6-v2
k: 100
rerank:
type: adapter.hf_crossencoder
config:
model: cross-encoder/ms-marco-MiniLM-L-6-v2
k: 10
combinations:
include:
- [bm25, rerank]
- [dense, rerank]
ablations: true # auto-generates [bm25] and [dense] prefix pipelines
metrics:
recall_at_k: [1, 5, 10, 20]
precision_at_k: [5, 10]
ndcg_at_k: [10]
mrr: true
map: true
execution:
concurrency: 4
timeout_seconds: 60
cache_results: true
output:
store: sqlite
db_path: .retobs/results.db
Expanded pipeline IDs are stable, for example bm25, dense, bm25__rerank, and dense__rerank.
Cost is configured for relative tradeoff analysis:
costs:
bm25:
per_1k_queries: 0.10
rerank:
per_1k_queries: 1.50
retobs run and the dashboard both treat this as an estimated cost model from your YAML, not measured cloud billing telemetry.
Stage cache note: When
execution.cache_results: true, retrieval stages are cached byhash(stage_config + upstream_candidates + query_id). The upstream candidate fingerprint ensures that two pipelines sharing the same reranker but with different first-stage retrievers (e.g.bm25→rerankvsdense→rerank) never share reranker snapshots. Stage 0 (first retriever) still shares cache entries across ablation combos as intended. Use--no-cachewhen you want fully independent execution for reproducibility auditing.
HTTP adapter schema
The adapter.http stage wraps any REST endpoint. Your server must accept:
Request — POST with JSON body:
{"query": "user question text", "k": 100}
When query filters are set, a filters object is also included.
Response — JSON in either shape:
{"documents": [{"id": "doc_1", "text": "...", "score": 0.92}]}
[{"id": "doc_1", "text": "...", "score": 0.92}]
Each document object must include the configured ID field (default id). Text and score fields default to text and score but can be remapped:
- type: adapter.http
url: http://localhost:8080/retrieve
config:
k: 100
id_field: doc_id
text_field: content
score_field: relevance
See [examples/http_quickstart/server.py](examples/http_quickstart/server.py) for a reference implementation.
Custom Python retriever via adapter.import
Use adapter.import to load a Python factory callable from your own module without editing retobs internals:
- type: adapter.import
retriever_id: keyword
config:
factory: retriever:build_retriever
k: 10
Supported factory paths:
package.module:callablepackage.module.callable
Factory signature:
def build_retriever(corpus: dict | None, stage_cfg: dict, **kwargs):
...
return retriever_or_reranker, k
Runnable example: [examples/custom_retriever/](examples/custom_retriever/)
Custom Dataset Format
queries.jsonl
{"query_id":"q1","text":"What changed in the refund policy?","relevant_doc_ids":{"doc_17":2,"doc_22":1},"temporal_anchor":"2024-01-15T00:00:00"}
relevant_doc_ids can be a list for binary labels or a dict for graded relevance.
corpus.jsonl
{"id":"doc_17","title":"Refund policy update","text":"Refunds are now processed within 7 days.","timestamp":"2024-01-10T00:00:00"}
Optional qrels.jsonl
{"query_id":"q1","doc_id":"doc_17","grade":2}
qrels.tsv in TREC-style format is also supported.
LLM-Assisted Labels
Gold labels are the default and remain the recommended evaluation source.
For unlabeled datasets, you can opt into LLM-assisted labels:
labels:
mode: pooled_llm_judge # gold, llm_judge, or pooled_llm_judge
judge: gemini # gemini, openai, or anthropic
model: gemini-2.0-flash
cache_path: .retobs/llm_judge_cache.db
Dashboard Features
| Feature | Description |
|---|---|
| Stage Attribution | Before/after metric table for each pipeline pair with BH-corrected significance. |
| Tradeoff Explorer | Latency budget + min quality delta sliders; verdict computed client-side. |
| Experiment Overview | Headline winner, difficulty buckets, failure-label summary, reproducibility warnings. |
| Pipeline Architecture | Stage-by-stage flow diagram with per-stage quality and latency. |
| Stage Combination Matrix | Compact view of quality, latency, and optional cost-per-1k by pipeline/stage. |
| Query Explorer | Query-level diagnostics with failure labels, missing relevant IDs, and difficulty bucket. |
| Run Comparison | Side-by-side metrics with query-ID-aligned paired bootstrap p-values. |
| Recall@K Curves | Recall trends across K with BEIR reference lines when available. |
| Stage Recall Funnel | Shows how much candidate recall survives through reranking stages. |
| Latency Breakdown | P50/P95/P99 plus profiling metrics for compute, network, and retries. |
| Segment Analysis | NDCG@10 by query metadata such as number of relevant docs. |
Example Runs
BEIR BM25 Baseline
retobs validate --config examples/beir_demo.yaml
retobs run --config examples/beir_demo.yaml
retobs serve --db .retobs/beir_demo.db
Three-Way nfcorpus Comparison
pip install -e ".[demo,dashboard,dense]"
retobs validate --config examples/nfcorpus_three_way.yaml
retobs run --config examples/nfcorpus_three_way.yaml --no-cache
retobs serve --db .retobs/nfcorpus_three_way.db
Temporal Recall Demo
pip install -e ".[demo,dashboard]"
python examples/temporal_demo/generate_data.py
retobs run --config examples/temporal_demo/config.yaml --no-cache
retobs serve --db .retobs/temporal_demo.db
This demo intentionally includes old and new relevant documents per query so recall@1 and temporal_recall@1 diverge when top-ranked hits are stale.
RRF Hybrid (BM25 + Dense)
pip install -e ".[demo,dashboard,dense]"
retobs run --config examples/rrf_hybrid.yaml
Dense vs BM25+Cohere Hybrid
pip install -e ".[demo,dashboard,dense,cohere]"
export COHERE_API_KEY=your-key-here
retobs run --config examples/hybrid_comparison.yaml
CLI Reference
retobs init --mode MODE --output PATH Generate starter config and sample data
retobs validate --config PATH [--db PATH] Validate config and dataset before running
retobs run --config PATH [--no-cache] Run a benchmark experiment
[--latency-budget-ms N] Print verdict against stage latency delta
retobs serve --db PATH [--db PATH ...] [--port N] Start dashboard (repeat --db for multiple SQLite files)
retobs compare RUN_ID_1 RUN_ID_2 --db PATH Compare runs with paired bootstrap tests
retobs inspect RUN_ID --query QUERY_ID [--pipeline ID] Debug per-query retrieval results
Init modes: beir, custom-jsonl, http-endpoint, bm25+dense (includes RRF), bm25+reranker (includes ablations).
Run The Test Suite
source .venv/bin/activate
pip install -e ".[demo,dashboard,dense,dev,llm-judge]"
pytest tests/ -q
npm --prefix retrieval_observatory/dashboard/ui run build
python -m compileall retrieval_observatory -q
Dashboard Development
The dashboard UI is pre-built in the PyPI wheel, so retobs serve works after pip install with no Node.js required. When developing from a git clone and editing React sources, rebuild the UI:
cd retrieval_observatory/dashboard/ui
npm install
npm run dev # hot-reloading dev server on :5173 (proxies API to retobs serve)
npm run build # rebuild dist/ before python -m build or tagging a release
Or use make dashboard-dev / make dashboard-build from the repo root.
Optional Dependency Groups
| Group | Installs | Use for |
|---|---|---|
demo |
beir, datasets, rank-bm25 | Running BEIR datasets with BM25 |
dashboard |
fastapi, uvicorn, python-multipart | Serving the dashboard and accepting uploads |
dense |
sentence-transformers, faiss-cpu, torch | Dense bi-encoder retrieval and local cross-encoder reranking |
dev |
pytest, pytest-asyncio, coverage, respx | Running tests |
cohere |
cohere | Cohere reranking |
langchain |
langchain-core | LangChain adapter (programmatic use) |
llamaindex |
llama-index-core | LlamaIndex adapter (programmatic use) |
pgvector |
asyncpg, pgvector | Pgvector adapter |
llm-judge |
google-generativeai, anthropic, openai | LLM-assisted relevance judging |
PostgreSQL backend (asyncpg) is community-supported and not CI-tested. SQLite is recommended for evaluation workloads.
pip install -e ".[demo,dashboard,dense,dev,llm-judge]"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file retrieval_observatory-0.1.2.tar.gz.
File metadata
- Download URL: retrieval_observatory-0.1.2.tar.gz
- Upload date:
- Size: 344.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9ad2eeff886ed0814efc934d6c956b3f87c63ce21d10a87f0ab3899e8f2d91e
|
|
| MD5 |
05b2585f9af03e22fa6def92c056251b
|
|
| BLAKE2b-256 |
7c1d80f9ba8b799ef7ea853a4fc3e3d53908bbd56a9a6145e4d5b1e5ca3028a7
|
Provenance
The following attestation bundles were made for retrieval_observatory-0.1.2.tar.gz:
Publisher:
publish.yml on AmeyaKI/retrieval-observatory
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
retrieval_observatory-0.1.2.tar.gz -
Subject digest:
b9ad2eeff886ed0814efc934d6c956b3f87c63ce21d10a87f0ab3899e8f2d91e - Sigstore transparency entry: 1727603208
- Sigstore integration time:
-
Permalink:
AmeyaKI/retrieval-observatory@10984198f931307b8846963c0a521f0822e8358f -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/AmeyaKI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@10984198f931307b8846963c0a521f0822e8358f -
Trigger Event:
push
-
Statement type:
File details
Details for the file retrieval_observatory-0.1.2-py3-none-any.whl.
File metadata
- Download URL: retrieval_observatory-0.1.2-py3-none-any.whl
- Upload date:
- Size: 380.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e81b617d04ef704ac50cc242fa0f962c9b89ab9ef04a01c197fa15b597c6ffbd
|
|
| MD5 |
247ba07365c311fd921e3e5566e39948
|
|
| BLAKE2b-256 |
fdcc305cb8c3ae5a94ebca3bab6efc7e2fc089cdfd5c9abeaced835ec139d76b
|
Provenance
The following attestation bundles were made for retrieval_observatory-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on AmeyaKI/retrieval-observatory
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
retrieval_observatory-0.1.2-py3-none-any.whl -
Subject digest:
e81b617d04ef704ac50cc242fa0f962c9b89ab9ef04a01c197fa15b597c6ffbd - Sigstore transparency entry: 1727603297
- Sigstore integration time:
-
Permalink:
AmeyaKI/retrieval-observatory@10984198f931307b8846963c0a521f0822e8358f -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/AmeyaKI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@10984198f931307b8846963c0a521f0822e8358f -
Trigger Event:
push
-
Statement type: