Skip to main content

A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation

Project description

RAGVue logo

RAGVue is a reference-free evaluation framework for Retrieval-Augmented Generation (RAG) systems that goes beyond single scores.
It provides interpretable diagnostics across retrieval, answer quality, and factual grounding, helping you pinpoint why a RAG output failed (retrieval vs generation vs grounding).

โœจ What you get

  • ๐Ÿ” Manual Mode โ€” choose the metrics you want
  • ๐Ÿค– Agentic Mode โ€” automatically selects and runs the right diagnostics
  • ๐Ÿ–ฅ๏ธ Streamlit UI โ€” no-code, interactive evaluation
  • ๐Ÿ”ง Multiple Interfaces โ€” two layers, each with the right scope:
    • Evaluation engine: Python API ยท CLI (ragvue-cli, ragvue-py) ยท FastAPI REST API
    • Interactive analysis: Streamlit UI (history, comparison, longitudinal, diagnostics)
  • ๐Ÿง  Multi-LLM Judge Backend โ€” OpenAI (default) or Anthropic Claude, switchable per run

18 Reference-Free Metrics

โœ… 6 core evaluation metrics โ€” retrieval, answer quality, grounding
๐Ÿ“‰ 6 calibration / stability metrics โ€” judge agreement & sensitivity
๐Ÿงฉ 6 complex failure-mode metrics โ€” real-world RAG breakdown patterns

Optional: 4 Lightweight Local Metrics ( no API calls)

Use it to:

๐ŸŽฏ pinpoint retrieval misses vs hallucinations
๐Ÿ” compare pipeline changes across iterations (before/after reports)
๐Ÿšฉ flag unstable judge signals via calibration before trusting a metric outcome

๐Ÿš€ Installation

Install from source

git clone <-repo-url> ragvue
cd ragvue
pip install -e .                  # core (OpenAI judge)
pip install -e ".[anthropic]"     # + Anthropic/Claude judge support
pip install -e ".[all]"           # everything (FastAPI + local metrics + Anthropic)

Set up API keys

RAGVue uses LLMs for evaluation. It supports OpenAI (default) and Anthropic Claude as judge backends.

Create a .env file in the root directory and add the key(s) for the provider you want to use:

# For OpenAI (default)
OPENAI_API_KEY = <your-key-here>

# For Claude/Anthropic (optional)
ANTHROPIC_API_KEY = <your-key-here>

To switch to Claude:

export RAGVUE_JUDGE_PROVIDER=anthropic   # or set in .env

Or select it in the Streamlit sidebar. Default provider is OpenAI (gpt-4o-mini); Claude uses claude-haiku-4-5-20251001 by default.

The RAG Advisor chat offers a separate model selector with six options across both providers:

Model Provider Tier
gpt-4o-mini OpenAI fast (default)
gpt-4o OpenAI capable
gpt-3.5-turbo OpenAI budget
claude-haiku-4-5-20251001 Anthropic fast
claude-sonnet-4-6 Anthropic balanced
claude-opus-4-6 Anthropic powerful

๐Ÿ—๏ธ Interface Design

RAGVue has two distinct layers. Understanding the split helps you choose the right interface.

Layer Interfaces Purpose
Evaluation Engine Python API, CLI (ragvue-cli, ragvue-py), FastAPI REST API Headless computation โ€” takes input items, runs metrics, returns scores. Fits into pipelines, CI/CD, notebooks, and custom tooling.
Interactive Analysis Streamlit UI Visualization, report history, run comparison, longitudinal tracking, chatbot diagnostics. Designed for humans, not pipelines.

Rule of thumb:

  • Automating or integrating into a system โ†’ use the API or CLI
  • Exploring results, debugging, comparing runs โ†’ use the Streamlit UI

Not every UI feature exists in the API by design โ€” report history, longitudinal tracking, and the diagnostic agent are analysis-layer features that belong in the dashboard, not in a REST endpoint.


๐Ÿง  Usage

RAGVue can be used via:

  • A. Python API
  • B. CLI tools (ragvue-cli & ragvue-py)
  • C. Streamlit UI (no-code)
  • D. FastAPI REST API

A. Python API

from ragvue import evaluate, load_metrics

items = [
    {"question": "...", "answer": "...", "context": [...]}
]

metrics = load_metrics().keys()
report = evaluate(items, metrics=list(metrics))

print(report)

B. Command-Line Interface (CLI)

1. ragvue-cli (main CLI)

Help & List available metrics

ragvue-cli --help
ragvue-cli list-metrics

Manual mode

ragvue-cli eval   --inputs <your_data.jsonl>   --metrics <metric_name>   --out-base report_manual   --formats "json,md,csv"

Agentic mode

ragvue-cli agentic   --inputs <your_data.jsonl>   --out-base report_agentic --formats "json,md,csv"

2. ragvue-py (lightweight Python runner)

Help

ragvue-py --help

Manual mode

ragvue-py   --input <your_data.jsonl>   --metrics <metrics>   --out-base report_manual   --skip-agentic

Agentic mode

ragvue-py   --input <your_data.jsonl>  --metrics <metrics> --agentic-out report_agentic   --skip-manual

C. Streamlit UI (No-Code Interface)

Launch the UI:

streamlit run streamlit_app.py

Features

  • Upload JSONL files
  • Manual & Agentic metric selection
  • Judge provider selector โ€” switch between OpenAI and Claude directly from the sidebar
  • API key input (OpenAI or Anthropic, depending on selected provider)
  • Global summary dashboard
  • Individual case-level diagnostic views
  • Multi-format export (JSON, Markdown, CSV, HTML)
  • Live progress bar โ€” per-item progress during evaluation
  • Custom report labels โ€” name a report before running for easy identification
  • Retrieval Only mode โ€” evaluate retrieval pipeline without requiring generated answers
  • Report history โ€” last 10 reports saved automatically; view, filter, and delete from the Reports tab
  • Report comparison โ€” select two saved reports side-by-side and inspect per-metric deltas (B โˆ’ A)
  • Longitudinal tracking โ€” persistent run registry with pipeline version and notes tags; themed metric trend chart (Plotly) and themed table across runs; automatic regression detection between the two most recent runs
  • Your RAG Advisor (Early Access โ€” not yet validated on real datasets; feedback welcome at ragvue.license@gmail.com) โ€” AI research thinking partner for RAG architecture advice; four sub-tabs:
    • Getting Started โ€” mini walkthrough explaining what the advisor is, how a session works, what to share, and what it can/can't do
    • Chat โ€” streaming responses (token-by-token), file/diagram upload, quick starters, export; active architecture profile shown as a status chip; auto-inject banner shares your latest evaluation results in one click; 6 model options across OpenAI and Anthropic tiers; two share modes: ๐Ÿ“Š Summary (mean scores per metric) and ๐Ÿ” Case Inspector (select any item from any saved report โ€” injects the full question, answer, contexts, and per-metric diagnostic fields)
    • My Profile โ€” save multiple architecture configurations (retriever, chunk size, embedding, LLM, framework, domain); full edit-in-place support; active profile is injected into every conversation automatically
    • Analysis Tools โ€” Before/After report comparison, Hypothesis Testing, Guided Diagnosis, Failure Mode Scanner (identifies active RAG failure modes and prioritises top issues), Suggest Next Experiment (recommends the single highest-ROI next change with expected metric impact)
  • Three UI themes โ€” Light (default), Dark (soft dim), and Beige; selectable from the โš™๏ธ Settings sidebar; all components including tables and charts adapt to the active theme

D. FastAPI REST API

Install with API dependencies:

pip install -e ".[all]"

Start the server:

ragvue-api                        # default: 0.0.0.0:8000
ragvue-api --port 9000            # custom port
ragvue-api --host 127.0.0.1       # custom host
ragvue-api --reload               # auto-reload for development

Endpoints

Method Endpoint Description
GET /health Server health check + loaded metric count
GET /metrics List all available metric names
POST /evaluate Evaluate a single item with chosen metrics
POST /evaluate/batch Evaluate multiple items (returns per-metric mean)
POST /evaluate/agentic Agentic evaluation โ€” auto-selects metrics per item

Example requests

Health check:

curl http://localhost:8000/health

List metrics:

curl http://localhost:8000/metrics

Evaluate a single item:

curl -X POST http://localhost:8000/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "item": {
      "question": "What is the capital of France?",
      "answer": "The capital of France is Paris.",
      "contexts": ["Paris is the capital and largest city of France."]
    },
    "metrics": ["answer_relevance", "strict_faithfulness"]
  }'

Batch evaluation:

curl -X POST http://localhost:8000/evaluate/batch \
  -H "Content-Type: application/json" \
  -d '{
    "items": [
      {"question": "...", "answer": "...", "contexts": ["..."]},
      {"question": "...", "answer": "...", "contexts": ["..."]}
    ],
    "metrics": ["answer_relevance", "clarity"]
  }'

Agentic evaluation (auto-selects metrics):

curl -X POST http://localhost:8000/evaluate/agentic \
  -H "Content-Type: application/json" \
  -d '{
    "items": [
      {"question": "...", "answer": "...", "contexts": ["..."]}
    ]
  }'

The interactive API docs are available at http://localhost:8000/docs once the server is running.


๐Ÿ“„ Input Format

RAGVue expects JSONL like:

{"question": "...", "answer": "...", "contexts": ["chunk1", "chunk2"]}

Metrics Overview

Inputs key: Q = Question, A = Answer, C = Contexts

Core Evaluation Metrics (6)

Category Metric Inputs Description
Retrieval Metrics Retrieval Relevance Q, C Evaluates how useful each retrieved chunk is for addressing the information needs of the question, based on per-chunk relevance scoring.
Retrieval Coverage Q, C Assesses whether the retrieved context collectively provides sufficient coverage for all sub-aspects required to answer the question.
Answer Metrics Answer Relevance Q, A Measures how well the answer aligns with the intent and scope of the question, identifying missing, irrelevant, or off-topic content.
Answer Completeness Q, A Determines whether the answer fully addresses all aspects of the question without omissions.
Clarity A Evaluates the linguistic quality of the answer, including grammar, fluency, logical flow, coherence, and overall readability.
Grounding Strict Faithfulness A, C Evaluates how many factual claims in the answer are directly supported by the retrieved context, enforcing strict evidence alignment (entity accuracy and temporal correctness)

Calibration Metrics (6)

Metric Inputs Description
Calibration: Retrieval Relevance Q, C Measures score stability of retrieval relevance across judge configurations.
Calibration: Retrieval Coverage Q, C Measures score stability of retrieval coverage across judge configurations.
Calibration: Answer Relevance Q, A Measures score stability of answer relevance across judge configurations.
Calibration: Answer Completeness Q, A Measures score stability of answer completeness across judge configurations.
Calibration: Clarity A Measures score stability of clarity across judge configurations.
Calibration: Strict Faithfulness A, C Measures score stability of strict faithfulness across judge configurations.

[NEW] Reference-Free Complex Failure Mode Metrics (6)

These metrics detect failure modes that the core metrics miss โ€” no reference answers required.

Category Metric Inputs What it catches Returns
Context Usage Context Utilization Q, A, C Retrieved context is fetched but ignored โ€” the answer doesn't actually use the evidence. utilized_chunks, unused_chunks, justification
Answer Quality Answer Conciseness Q, A Verbose, repetitive, or filler-heavy answers that obscure the core response. redundant_parts, filler_detected, justification
Coherence Q, A Internal self-contradictions, logical fallacies, non-sequiturs, and circular reasoning within the answer. contradictions, logical_issues, justification
Unanswerable Handling Negative Rejection Q, A, C System confidently answers when context doesn't support any answer (should say "I don't know"). context_sufficient, answer_refuses, justification
Multi-Hop Reasoning Multi-Hop Faithfulness Q, A, C Broken reasoning chains โ€” each step may look fine individually, but the chain is invalid. reasoning_chain, valid_hops, broken_hops, justification
Subtle Contradictions Implicit Contradiction Q, A, C Subtle contradictions strict faithfulness misses: omitted qualifiers, shifted scope, negation flips, temporal misattribution. contradictions, contradiction_types, justification

[NEW] Lightweight Local Metrics (4)

All 4 local metrics run entirely on your machine with zero API calls, zero cost, and near-instant execution. Each metric takes a dictionary with question, answer, and contexts fields and returns a dictionary with a name, score (0.0 to 1.0), and additional details.

  • Token Overlap
  • Answer Length
  • Context Similarity
  • Readability

Metric Selection Guide

By use case

Use case Recommended metrics
Quick quality check Answer Relevance, Strict Faithfulness, Clarity
Full evaluation All core metrics
Hallucination audit Strict Faithfulness, Implicit Contradiction, Multi-Hop Faithfulness
Retrieval pipeline debugging Retrieval Relevance, Retrieval Coverage, Context Utilization
Production safety check Negative Rejection, Strict Faithfulness, Coherence
Answer quality tuning Clarity, Answer Conciseness, Coherence, Answer Completeness

By input availability

What you have Metrics you can run
Q + C only Retrieval Relevance, Retrieval Coverage
Q + A only Answer Relevance, Answer Completeness, Clarity, Answer Conciseness, Coherence
Q + A + C All metrics

๐Ÿ” Licensing

RAGVue is released under the Apache License 2.0.

For full license text, see: https://www.apache.org/licenses/LICENSE-2.0

๐Ÿ“ฉ Contact

For questions, please contact: ragvue.license@gmail.com

๐Ÿ“š Citation

Our demo paper has been accepted to EACL 2026 (Demo Track).

Title: RAGVue: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation
Status: Accepted (EACL 2026 Demo Track)
Preprint: https://arxiv.org/abs/2601.04196
ACL Anthology: https://aclanthology.org/2026.eacl-demo.35/

If you use RAGVue in your research, please cite:

@inproceedings{murugaraj-etal-2026-ragvue,
    title = "{RAGVUE}: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation",
    author = "Murugaraj, Keerthana  and
      Lamsiyah, Salima  and
      Theobald, Martin",
    editor = "Croce, Danilo  and
      Leidner, Jochen  and
      Moosavi, Nafise Sadat",
    booktitle = "Proceedings of the 19th Conference of the {E}uropean Chapter of the {A}ssociation for {C}omputational {L}inguistics (Volume 3: System Demonstrations)",
    month = mar,
    year = "2026",
    address = "Rabat, Marocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.eacl-demo.35/",
    pages = "512--526",
    ISBN = "979-8-89176-382-1",
  }

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragvue-0.6.0.tar.gz (58.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragvue-0.6.0-py3-none-any.whl (72.8 kB view details)

Uploaded Python 3

File details

Details for the file ragvue-0.6.0.tar.gz.

File metadata

  • Download URL: ragvue-0.6.0.tar.gz
  • Upload date:
  • Size: 58.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for ragvue-0.6.0.tar.gz
Algorithm Hash digest
SHA256 43b4cb0528ca92b4a7c529df048f813d6d40e95d5989a22288aa1bd81f2f90eb
MD5 f20aa67beef4977c2549f04c7ed6883d
BLAKE2b-256 2fe36812e56c6b138344be7fccd9aff0a105ed6e9777e6c5ee72af9fec83887e

See more details on using hashes here.

File details

Details for the file ragvue-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: ragvue-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 72.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for ragvue-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c618c93a9a11218b4d836b732a4ce82c690f969131ee225444f236b8698203cd
MD5 be03948e5773832fc0b71bf0d05149f0
BLAKE2b-256 5749f3e61711926a3aaba3ad99ec2415d0c36f2a71ef6093353cd6b507bc3dc5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page