A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation
Project description
RAGVue is a reference-free evaluation framework for Retrieval-Augmented Generation (RAG) systems that goes beyond single scores.
It provides interpretable diagnostics across retrieval, answer quality, and factual grounding, helping you pinpoint why a RAG output failed (retrieval vs generation vs grounding).
โจ What you get
- ๐ Manual Mode โ choose the metrics you want
- ๐ค Agentic Mode โ automatically selects and runs the right diagnostics
- ๐ฅ๏ธ Streamlit UI โ no-code, interactive evaluation
- ๐ง Multiple Interfaces โ two layers, each with the right scope:
- Evaluation engine: Python API ยท CLI (
ragvue-cli,ragvue-py) ยท FastAPI REST API - Interactive analysis: Streamlit UI (history, comparison, longitudinal, diagnostics)
- Evaluation engine: Python API ยท CLI (
- ๐ง Multi-LLM Judge Backend โ OpenAI (default) or Anthropic Claude, switchable per run
|
18 Reference-Free Metrics Optional: 4 Lightweight Local Metrics ( no API calls) |
Use it to: |
๐ Installation
Install with pip (recommended)
pip install ragvue # core evaluation engine (OpenAI judge)
pip install "ragvue[ui]" # + Streamlit dashboard
pip install "ragvue[anthropic]" # + Anthropic/Claude judge support
pip install "ragvue[local]" # + local metrics (no API, scikit-learn)
pip install "ragvue[api]" # + FastAPI REST server
pip install "ragvue[all]" # everything
Latest release: https://pypi.org/project/ragvue/#history
Install from source
git clone <repo-url> ragvue
cd ragvue
pip install -e . # core (OpenAI judge)
pip install -e ".[ui]" # + Streamlit dashboard
pip install -e ".[anthropic]" # + Anthropic/Claude judge support
pip install -e ".[local]" # + local metrics (scikit-learn)
pip install -e ".[api]" # + FastAPI REST server
pip install -e ".[all]" # everything
Set up API keys
RAGVue uses LLMs for evaluation. It supports OpenAI (default) and Anthropic Claude as judge backends.
Create a .env file in the root directory and add the key(s) for the provider you want to use:
# For OpenAI (default)
OPENAI_API_KEY = <your-key-here>
# For Claude/Anthropic (optional)
ANTHROPIC_API_KEY = <your-key-here>
To switch to Claude:
export RAGVUE_JUDGE_PROVIDER=anthropic # or set in .env
Or select it in the Streamlit sidebar. Default provider is OpenAI (gpt-4o-mini); Claude uses claude-haiku-4-5-20251001 by default.
The RAG Advisor chat offers a separate model selector with six options across both providers:
| Model | Provider | Tier |
|---|---|---|
gpt-4o-mini |
OpenAI | fast (default) |
gpt-4o |
OpenAI | capable |
gpt-3.5-turbo |
OpenAI | budget |
claude-haiku-4-5-20251001 |
Anthropic | fast |
claude-sonnet-4-6 |
Anthropic | balanced |
claude-opus-4-6 |
Anthropic | powerful |
๐๏ธ Interface Design
RAGVue has two distinct layers. Understanding the split helps you choose the right interface.
| Layer | Interfaces | Purpose |
|---|---|---|
| Evaluation Engine | Python API, CLI (ragvue-cli, ragvue-py), FastAPI REST API |
Headless computation โ takes input items, runs metrics, returns scores. Fits into pipelines, CI/CD, notebooks, and custom tooling. |
| Interactive Analysis | Streamlit UI | Visualization, report history, run comparison, longitudinal tracking, chatbot diagnostics. Designed for humans, not pipelines. |
Rule of thumb:
- Automating or integrating into a system โ use the API or CLI
- Exploring results, debugging, comparing runs โ use the Streamlit UI
Not every UI feature exists in the API by design โ report history, longitudinal tracking, and the diagnostic agent are analysis-layer features that belong in the dashboard, not in a REST endpoint.
๐ง Usage
RAGVue can be used via:
- A. Python API
- B. CLI tools (
ragvue-cli&ragvue-py) - C. Streamlit UI (no-code)
- D. FastAPI REST API
A. Python API
from ragvue import evaluate, load_metrics
items = [
{"question": "...", "answer": "...", "context": [...]}
]
metrics = load_metrics().keys()
report = evaluate(items, metrics=list(metrics))
print(report)
B. Command-Line Interface (CLI)
1. ragvue-cli (main CLI)
Help & List available metrics
ragvue-cli --help
ragvue-cli list-metrics
Manual mode
ragvue-cli eval --inputs <your_data.jsonl> --metrics <metric_name> --out-base report_manual --formats "json,md,csv"
Agentic mode
ragvue-cli agentic --inputs <your_data.jsonl> --out-base report_agentic --formats "json,md,csv"
2. ragvue-py (lightweight Python runner)
Help
ragvue-py --help
Manual mode
ragvue-py --input <your_data.jsonl> --metrics <metrics> --out-base report_manual --skip-agentic
Agentic mode
ragvue-py --input <your_data.jsonl> --metrics <metrics> --agentic-out report_agentic --skip-manual
C. Streamlit UI (No-Code Interface)
Launch the UI:
streamlit run streamlit_app.py
Features
- Upload JSONL files
- Manual & Agentic metric selection
- Judge provider selector โ switch between OpenAI and Claude directly from the sidebar
- API key input (OpenAI or Anthropic, depending on selected provider)
- Global summary dashboard
- Individual case-level diagnostic views
- Multi-format export (JSON, Markdown, CSV, HTML)
- Live progress bar โ per-item progress during evaluation
- Custom report labels โ name a report before running for easy identification
- Retrieval Only mode โ evaluate retrieval pipeline without requiring generated answers
- Report history โ last 10 reports saved automatically; view, filter, and delete from the Reports tab
- Report comparison โ select two saved reports side-by-side and inspect per-metric deltas (B โ A)
- Longitudinal tracking โ persistent run registry with pipeline version and notes tags; themed metric trend chart (Plotly) and themed table across runs; automatic regression detection between the two most recent runs
- Your RAG Advisor (Early Access โ not yet validated on real datasets; feedback welcome at ragvue.license@gmail.com) โ AI research thinking partner for RAG architecture advice; four sub-tabs:
- Getting Started โ mini walkthrough explaining what the advisor is, how a session works, what to share, and what it can/can't do
- Chat โ streaming responses (token-by-token), file/diagram upload, quick starters, export; active architecture profile shown as a status chip; auto-inject banner shares your latest evaluation results in one click; 6 model options across OpenAI and Anthropic tiers; two share modes: ๐ Summary (mean scores per metric) and ๐ Case Inspector (select any item from any saved report โ injects the full question, answer, contexts, and per-metric diagnostic fields)
- My Profile โ save multiple architecture configurations (retriever, chunk size, embedding, LLM, framework, domain); full edit-in-place support; active profile is injected into every conversation automatically
- Analysis Tools โ Before/After report comparison, Hypothesis Testing, Guided Diagnosis, Failure Mode Scanner (identifies active RAG failure modes and prioritises top issues), Suggest Next Experiment (recommends the single highest-ROI next change with expected metric impact)
- Three UI themes โ Light (default), Dark (soft dim), and Beige; selectable from the โ๏ธ Settings sidebar; all components including tables and charts adapt to the active theme
D. FastAPI REST API
Install with API dependencies:
pip install -e ".[all]"
Start the server:
ragvue-api # default: 0.0.0.0:8000
ragvue-api --port 9000 # custom port
ragvue-api --host 127.0.0.1 # custom host
ragvue-api --reload # auto-reload for development
Endpoints
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Server health check + loaded metric count |
GET |
/metrics |
List all available metric names |
POST |
/evaluate |
Evaluate a single item with chosen metrics |
POST |
/evaluate/batch |
Evaluate multiple items (returns per-metric mean) |
POST |
/evaluate/agentic |
Agentic evaluation โ auto-selects metrics per item |
Example requests
Health check:
curl http://localhost:8000/health
List metrics:
curl http://localhost:8000/metrics
Evaluate a single item:
curl -X POST http://localhost:8000/evaluate \
-H "Content-Type: application/json" \
-d '{
"item": {
"question": "What is the capital of France?",
"answer": "The capital of France is Paris.",
"contexts": ["Paris is the capital and largest city of France."]
},
"metrics": ["answer_relevance", "strict_faithfulness"]
}'
Batch evaluation:
curl -X POST http://localhost:8000/evaluate/batch \
-H "Content-Type: application/json" \
-d '{
"items": [
{"question": "...", "answer": "...", "contexts": ["..."]},
{"question": "...", "answer": "...", "contexts": ["..."]}
],
"metrics": ["answer_relevance", "clarity"]
}'
Agentic evaluation (auto-selects metrics):
curl -X POST http://localhost:8000/evaluate/agentic \
-H "Content-Type: application/json" \
-d '{
"items": [
{"question": "...", "answer": "...", "contexts": ["..."]}
]
}'
The interactive API docs are available at http://localhost:8000/docs once the server is running.
๐ Input Format
RAGVue expects JSONL like:
{"question": "...", "answer": "...", "contexts": ["chunk1", "chunk2"]}
Metrics Overview
Inputs key: Q = Question, A = Answer, C = Contexts
Core Evaluation Metrics (6)
| Category | Metric | Inputs | Description |
|---|---|---|---|
| Retrieval Metrics | Retrieval Relevance | Q, C | Evaluates how useful each retrieved chunk is for addressing the information needs of the question, based on per-chunk relevance scoring. |
| Retrieval Coverage | Q, C | Assesses whether the retrieved context collectively provides sufficient coverage for all sub-aspects required to answer the question. | |
| Answer Metrics | Answer Relevance | Q, A | Measures how well the answer aligns with the intent and scope of the question, identifying missing, irrelevant, or off-topic content. |
| Answer Completeness | Q, A | Determines whether the answer fully addresses all aspects of the question without omissions. | |
| Clarity | A | Evaluates the linguistic quality of the answer, including grammar, fluency, logical flow, coherence, and overall readability. | |
| Grounding | Strict Faithfulness | A, C | Evaluates how many factual claims in the answer are directly supported by the retrieved context, enforcing strict evidence alignment (entity accuracy and temporal correctness) |
Calibration Metrics (6)
| Metric | Inputs | Description |
|---|---|---|
| Calibration: Retrieval Relevance | Q, C | Measures score stability of retrieval relevance across judge configurations. |
| Calibration: Retrieval Coverage | Q, C | Measures score stability of retrieval coverage across judge configurations. |
| Calibration: Answer Relevance | Q, A | Measures score stability of answer relevance across judge configurations. |
| Calibration: Answer Completeness | Q, A | Measures score stability of answer completeness across judge configurations. |
| Calibration: Clarity | A | Measures score stability of clarity across judge configurations. |
| Calibration: Strict Faithfulness | A, C | Measures score stability of strict faithfulness across judge configurations. |
[NEW] Reference-Free Complex Failure Mode Metrics (6)
These metrics detect failure modes that the core metrics miss โ no reference answers required.
| Category | Metric | Inputs | What it catches | Returns |
|---|---|---|---|---|
| Context Usage | Context Utilization | Q, A, C | Retrieved context is fetched but ignored โ the answer doesn't actually use the evidence. | utilized_chunks, unused_chunks, justification |
| Answer Quality | Answer Conciseness | Q, A | Verbose, repetitive, or filler-heavy answers that obscure the core response. | redundant_parts, filler_detected, justification |
| Coherence | Q, A | Internal self-contradictions, logical fallacies, non-sequiturs, and circular reasoning within the answer. | contradictions, logical_issues, justification |
|
| Unanswerable Handling | Negative Rejection | Q, A, C | System confidently answers when context doesn't support any answer (should say "I don't know"). | context_sufficient, answer_refuses, justification |
| Multi-Hop Reasoning | Multi-Hop Faithfulness | Q, A, C | Broken reasoning chains โ each step may look fine individually, but the chain is invalid. | reasoning_chain, valid_hops, broken_hops, justification |
| Subtle Contradictions | Implicit Contradiction | Q, A, C | Subtle contradictions strict faithfulness misses: omitted qualifiers, shifted scope, negation flips, temporal misattribution. | contradictions, contradiction_types, justification |
[NEW] Lightweight Local Metrics (4)
All 4 local metrics run entirely on your machine with zero API calls, zero cost, and near-instant execution. Each metric takes a dictionary with question, answer, and contexts fields and returns a dictionary with a name, score (0.0 to 1.0), and additional details.
- Token Overlap
- Answer Length
- Context Similarity
- Readability
Metric Selection Guide
By use case
| Use case | Recommended metrics |
|---|---|
| Quick quality check | Answer Relevance, Strict Faithfulness, Clarity |
| Full evaluation | All core metrics |
| Hallucination audit | Strict Faithfulness, Implicit Contradiction, Multi-Hop Faithfulness |
| Retrieval pipeline debugging | Retrieval Relevance, Retrieval Coverage, Context Utilization |
| Production safety check | Negative Rejection, Strict Faithfulness, Coherence |
| Answer quality tuning | Clarity, Answer Conciseness, Coherence, Answer Completeness |
By input availability
| What you have | Metrics you can run |
|---|---|
| Q + C only | Retrieval Relevance, Retrieval Coverage |
| Q + A only | Answer Relevance, Answer Completeness, Clarity, Answer Conciseness, Coherence |
| Q + A + C | All metrics |
๐ Licensing
RAGVue is released under the Apache License 2.0.
For full license text, see: https://www.apache.org/licenses/LICENSE-2.0
๐ฉ Contact
For questions, please contact: ragvue.license@gmail.com
๐ Citation
Our demo paper has been accepted to EACL 2026 (Demo Track).
Title: RAGVue: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation
Status: Accepted (EACL 2026 Demo Track)
Preprint: https://arxiv.org/abs/2601.04196
ACL Anthology: https://aclanthology.org/2026.eacl-demo.35/
If you use RAGVue in your research, please cite:
@inproceedings{murugaraj-etal-2026-ragvue,
title = "{RAGVUE}: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation",
author = "Murugaraj, Keerthana and
Lamsiyah, Salima and
Theobald, Martin",
editor = "Croce, Danilo and
Leidner, Jochen and
Moosavi, Nafise Sadat",
booktitle = "Proceedings of the 19th Conference of the {E}uropean Chapter of the {A}ssociation for {C}omputational {L}inguistics (Volume 3: System Demonstrations)",
month = mar,
year = "2026",
address = "Rabat, Marocco",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.eacl-demo.35/",
pages = "512--526",
ISBN = "979-8-89176-382-1",
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragvue-0.6.1.tar.gz.
File metadata
- Download URL: ragvue-0.6.1.tar.gz
- Upload date:
- Size: 59.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7dd33517f6a0c21320cc055249fe06c2894b6c7c9a20ac39125d25d952d5485d
|
|
| MD5 |
fadd1954d2066a16174fbda3b0acdb34
|
|
| BLAKE2b-256 |
672b3d3e56f011c67301b8e500f642a65b9c6be17fdc1391dca5288adbcc0007
|
File details
Details for the file ragvue-0.6.1-py3-none-any.whl.
File metadata
- Download URL: ragvue-0.6.1-py3-none-any.whl
- Upload date:
- Size: 73.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f463998057ef215c8dbe66c0b96b64e0fd87c21f19436d999fe26d93838c518d
|
|
| MD5 |
8d945be9cbf1af58c972ede6bb4d0f25
|
|
| BLAKE2b-256 |
09c5b03a5bf3d9cf40749913480d3cbc90c370d4e4d0412469da7969de507564
|