ragvue

A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

RAGVue logo

RAGVue is a reference-free evaluation framework for Retrieval-Augmented Generation (RAG) systems that goes beyond single scores.
It provides interpretable diagnostics across retrieval, answer quality, and factual grounding, helping you pinpoint why a RAG output failed (retrieval vs generation vs grounding).

✨ What you get

🔍 Manual Mode — choose the metrics you want
🤖 Agentic Mode — automatically selects and runs the right diagnostics
🖥️ Streamlit UI — no-code, interactive evaluation
🔧 Multiple Interfaces — two layers, each with the right scope:
- Evaluation engine: Python API · CLI (ragvue-cli, ragvue-py) · FastAPI REST API
- Interactive analysis: Streamlit UI (history, comparison, longitudinal, diagnostics)
🧠 Multi-LLM Judge Backend — OpenAI (default) or Anthropic Claude, switchable per run

18 Reference-Free Metrics

✅ 6 core evaluation metrics — retrieval, answer quality, grounding
📉 6 calibration / stability metrics — judge agreement & sensitivity
🧩 6 complex failure-mode metrics — real-world RAG breakdown patterns

Optional: 4 Lightweight Local Metrics ( no API calls)

Use it to:

🎯 pinpoint retrieval misses vs hallucinations
🔍 compare pipeline changes across iterations (before/after reports)
🚩 flag unstable judge signals via calibration before trusting a metric outcome

🚀 Installation

Install from source

git clone <-repo-url> ragvue
cd ragvue
pip install -e .                  # core (OpenAI judge)
pip install -e ".[anthropic]"     # + Anthropic/Claude judge support
pip install -e ".[all]"           # everything (FastAPI + local metrics + Anthropic)

Set up API keys

RAGVue uses LLMs for evaluation. It supports OpenAI (default) and Anthropic Claude as judge backends.

Create a .env file in the root directory and add the key(s) for the provider you want to use:

# For OpenAI (default)
OPENAI_API_KEY = <your-key-here>

# For Claude/Anthropic (optional)
ANTHROPIC_API_KEY = <your-key-here>

To switch to Claude:

export RAGVUE_JUDGE_PROVIDER=anthropic   # or set in .env

Or select it in the Streamlit sidebar. Default provider is OpenAI (gpt-4o-mini); Claude uses claude-haiku-4-5-20251001 by default.

The RAG Advisor chat offers a separate model selector with six options across both providers:

Model	Provider	Tier
`gpt-4o-mini`	OpenAI	fast (default)
`gpt-4o`	OpenAI	capable
`gpt-3.5-turbo`	OpenAI	budget
`claude-haiku-4-5-20251001`	Anthropic	fast
`claude-sonnet-4-6`	Anthropic	balanced
`claude-opus-4-6`	Anthropic	powerful

🏗️ Interface Design

RAGVue has two distinct layers. Understanding the split helps you choose the right interface.

Layer	Interfaces	Purpose
Evaluation Engine	Python API, CLI (`ragvue-cli`, `ragvue-py`), FastAPI REST API	Headless computation — takes input items, runs metrics, returns scores. Fits into pipelines, CI/CD, notebooks, and custom tooling.
Interactive Analysis	Streamlit UI	Visualization, report history, run comparison, longitudinal tracking, chatbot diagnostics. Designed for humans, not pipelines.

Rule of thumb:

Automating or integrating into a system → use the API or CLI
Exploring results, debugging, comparing runs → use the Streamlit UI

Not every UI feature exists in the API by design — report history, longitudinal tracking, and the diagnostic agent are analysis-layer features that belong in the dashboard, not in a REST endpoint.

🧠 Usage

RAGVue can be used via:

A. Python API
B. CLI tools (ragvue-cli & ragvue-py)
C. Streamlit UI (no-code)
D. FastAPI REST API

A. Python API

from ragvue import evaluate, load_metrics

items = [
    {"question": "...", "answer": "...", "context": [...]}
]

metrics = load_metrics().keys()
report = evaluate(items, metrics=list(metrics))

print(report)

B. Command-Line Interface (CLI)

1. `ragvue-cli` (main CLI)

Help & List available metrics

ragvue-cli --help
ragvue-cli list-metrics

Manual mode

ragvue-cli eval   --inputs <your_data.jsonl>   --metrics <metric_name>   --out-base report_manual   --formats "json,md,csv"

Agentic mode

ragvue-cli agentic   --inputs <your_data.jsonl>   --out-base report_agentic --formats "json,md,csv"

2. `ragvue-py` (lightweight Python runner)

Help

ragvue-py --help

Manual mode

ragvue-py   --input <your_data.jsonl>   --metrics <metrics>   --out-base report_manual   --skip-agentic

Agentic mode

ragvue-py   --input <your_data.jsonl>  --metrics <metrics> --agentic-out report_agentic   --skip-manual

C. Streamlit UI (No-Code Interface)

Launch the UI:

streamlit run streamlit_app.py

Features

Upload JSONL files
Manual & Agentic metric selection
Judge provider selector — switch between OpenAI and Claude directly from the sidebar
API key input (OpenAI or Anthropic, depending on selected provider)
Global summary dashboard
Individual case-level diagnostic views
Multi-format export (JSON, Markdown, CSV, HTML)
Live progress bar — per-item progress during evaluation
Custom report labels — name a report before running for easy identification
Retrieval Only mode — evaluate retrieval pipeline without requiring generated answers
Report history — last 10 reports saved automatically; view, filter, and delete from the Reports tab
Report comparison — select two saved reports side-by-side and inspect per-metric deltas (B − A)
Longitudinal tracking — persistent run registry with pipeline version and notes tags; themed metric trend chart (Plotly) and themed table across runs; automatic regression detection between the two most recent runs
Your RAG Advisor (Early Access — not yet validated on real datasets; feedback welcome at ragvue.license@gmail.com) — AI research thinking partner for RAG architecture advice; four sub-tabs:
- Getting Started — mini walkthrough explaining what the advisor is, how a session works, what to share, and what it can/can't do
- Chat — streaming responses (token-by-token), file/diagram upload, quick starters, export; active architecture profile shown as a status chip; auto-inject banner shares your latest evaluation results in one click; 6 model options across OpenAI and Anthropic tiers; two share modes: 📊 Summary (mean scores per metric) and 🔍 Case Inspector (select any item from any saved report — injects the full question, answer, contexts, and per-metric diagnostic fields)
- My Profile — save multiple architecture configurations (retriever, chunk size, embedding, LLM, framework, domain); full edit-in-place support; active profile is injected into every conversation automatically
- Analysis Tools — Before/After report comparison, Hypothesis Testing, Guided Diagnosis, Failure Mode Scanner (identifies active RAG failure modes and prioritises top issues), Suggest Next Experiment (recommends the single highest-ROI next change with expected metric impact)
Three UI themes — Light (default), Dark (soft dim), and Beige; selectable from the ⚙️ Settings sidebar; all components including tables and charts adapt to the active theme

D. FastAPI REST API

Install with API dependencies:

pip install -e ".[all]"

Start the server:

ragvue-api                        # default: 0.0.0.0:8000
ragvue-api --port 9000            # custom port
ragvue-api --host 127.0.0.1       # custom host
ragvue-api --reload               # auto-reload for development

Endpoints

Method	Endpoint	Description
`GET`	`/health`	Server health check + loaded metric count
`GET`	`/metrics`	List all available metric names
`POST`	`/evaluate`	Evaluate a single item with chosen metrics
`POST`	`/evaluate/batch`	Evaluate multiple items (returns per-metric mean)
`POST`	`/evaluate/agentic`	Agentic evaluation — auto-selects metrics per item

Example requests

Health check:

curl http://localhost:8000/health

List metrics:

curl http://localhost:8000/metrics

Evaluate a single item:

curl -X POST http://localhost:8000/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "item": {
      "question": "What is the capital of France?",
      "answer": "The capital of France is Paris.",
      "contexts": ["Paris is the capital and largest city of France."]
    },
    "metrics": ["answer_relevance", "strict_faithfulness"]
  }'

Batch evaluation:

curl -X POST http://localhost:8000/evaluate/batch \
  -H "Content-Type: application/json" \
  -d '{
    "items": [
      {"question": "...", "answer": "...", "contexts": ["..."]},
      {"question": "...", "answer": "...", "contexts": ["..."]}
    ],
    "metrics": ["answer_relevance", "clarity"]
  }'

Agentic evaluation (auto-selects metrics):

curl -X POST http://localhost:8000/evaluate/agentic \
  -H "Content-Type: application/json" \
  -d '{
    "items": [
      {"question": "...", "answer": "...", "contexts": ["..."]}
    ]
  }'

The interactive API docs are available at http://localhost:8000/docs once the server is running.

📄 Input Format

RAGVue expects JSONL like:

{"question": "...", "answer": "...", "contexts": ["chunk1", "chunk2"]}

Metrics Overview

Inputs key: Q = Question, A = Answer, C = Contexts

Core Evaluation Metrics (6)

Category	Metric	Inputs	Description
Retrieval Metrics	Retrieval Relevance	Q, C	Evaluates how useful each retrieved chunk is for addressing the information needs of the question, based on per-chunk relevance scoring.
	Retrieval Coverage	Q, C	Assesses whether the retrieved context collectively provides sufficient coverage for all sub-aspects required to answer the question.
Answer Metrics	Answer Relevance	Q, A	Measures how well the answer aligns with the intent and scope of the question, identifying missing, irrelevant, or off-topic content.
	Answer Completeness	Q, A	Determines whether the answer fully addresses all aspects of the question without omissions.
	Clarity	A	Evaluates the linguistic quality of the answer, including grammar, fluency, logical flow, coherence, and overall readability.
Grounding	Strict Faithfulness	A, C	Evaluates how many factual claims in the answer are directly supported by the retrieved context, enforcing strict evidence alignment (entity accuracy and temporal correctness)

Calibration Metrics (6)

Metric	Inputs	Description
Calibration: Retrieval Relevance	Q, C	Measures score stability of retrieval relevance across judge configurations.
Calibration: Retrieval Coverage	Q, C	Measures score stability of retrieval coverage across judge configurations.
Calibration: Answer Relevance	Q, A	Measures score stability of answer relevance across judge configurations.
Calibration: Answer Completeness	Q, A	Measures score stability of answer completeness across judge configurations.
Calibration: Clarity	A	Measures score stability of clarity across judge configurations.
Calibration: Strict Faithfulness	A, C	Measures score stability of strict faithfulness across judge configurations.

[NEW] Reference-Free Complex Failure Mode Metrics (6)

These metrics detect failure modes that the core metrics miss — no reference answers required.

Category	Metric	Inputs	What it catches	Returns
Context Usage	Context Utilization	Q, A, C	Retrieved context is fetched but ignored — the answer doesn't actually use the evidence.	`utilized_chunks`, `unused_chunks`, `justification`
Answer Quality	Answer Conciseness	Q, A	Verbose, repetitive, or filler-heavy answers that obscure the core response.	`redundant_parts`, `filler_detected`, `justification`
	Coherence	Q, A	Internal self-contradictions, logical fallacies, non-sequiturs, and circular reasoning within the answer.	`contradictions`, `logical_issues`, `justification`
Unanswerable Handling	Negative Rejection	Q, A, C	System confidently answers when context doesn't support any answer (should say "I don't know").	`context_sufficient`, `answer_refuses`, `justification`
Multi-Hop Reasoning	Multi-Hop Faithfulness	Q, A, C	Broken reasoning chains — each step may look fine individually, but the chain is invalid.	`reasoning_chain`, `valid_hops`, `broken_hops`, `justification`
Subtle Contradictions	Implicit Contradiction	Q, A, C	Subtle contradictions strict faithfulness misses: omitted qualifiers, shifted scope, negation flips, temporal misattribution.	`contradictions`, `contradiction_types`, `justification`

[NEW] Lightweight Local Metrics (4)

All 4 local metrics run entirely on your machine with zero API calls, zero cost, and near-instant execution. Each metric takes a dictionary with question, answer, and contexts fields and returns a dictionary with a name, score (0.0 to 1.0), and additional details.

Token Overlap
Answer Length
Context Similarity
Readability

Metric Selection Guide

By use case

Use case	Recommended metrics
Quick quality check	Answer Relevance, Strict Faithfulness, Clarity
Full evaluation	All core metrics
Hallucination audit	Strict Faithfulness, Implicit Contradiction, Multi-Hop Faithfulness
Retrieval pipeline debugging	Retrieval Relevance, Retrieval Coverage, Context Utilization
Production safety check	Negative Rejection, Strict Faithfulness, Coherence
Answer quality tuning	Clarity, Answer Conciseness, Coherence, Answer Completeness

By input availability

What you have	Metrics you can run
Q + C only	Retrieval Relevance, Retrieval Coverage
Q + A only	Answer Relevance, Answer Completeness, Clarity, Answer Conciseness, Coherence
Q + A + C	All metrics

🔐 Licensing

RAGVue is released under the Apache License 2.0.

For full license text, see: https://www.apache.org/licenses/LICENSE-2.0

📩 Contact

For questions, please contact: ragvue.license@gmail.com

📚 Citation

Our demo paper has been accepted to EACL 2026 (Demo Track).

Title: RAGVue: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation
Status: Accepted (EACL 2026 Demo Track)
Preprint: https://arxiv.org/abs/2601.04196
ACL Anthology: https://aclanthology.org/2026.eacl-demo.35/

If you use RAGVue in your research, please cite:

@inproceedings{murugaraj-etal-2026-ragvue,
    title = "{RAGVUE}: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation",
    author = "Murugaraj, Keerthana  and
      Lamsiyah, Salima  and
      Theobald, Martin",
    editor = "Croce, Danilo  and
      Leidner, Jochen  and
      Moosavi, Nafise Sadat",
    booktitle = "Proceedings of the 19th Conference of the {E}uropean Chapter of the {A}ssociation for {C}omputational {L}inguistics (Volume 3: System Demonstrations)",
    month = mar,
    year = "2026",
    address = "Rabat, Marocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.eacl-demo.35/",
    pages = "512--526",
    ISBN = "979-8-89176-382-1",
  }

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.6.1

May 8, 2026

This version

0.6.0

May 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragvue-0.6.0.tar.gz (58.6 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ragvue-0.6.0-py3-none-any.whl (72.8 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file ragvue-0.6.0.tar.gz.

File metadata

Download URL: ragvue-0.6.0.tar.gz
Upload date: May 8, 2026
Size: 58.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for ragvue-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`43b4cb0528ca92b4a7c529df048f813d6d40e95d5989a22288aa1bd81f2f90eb`
MD5	`f20aa67beef4977c2549f04c7ed6883d`
BLAKE2b-256	`2fe36812e56c6b138344be7fccd9aff0a105ed6e9777e6c5ee72af9fec83887e`

See more details on using hashes here.

File details

Details for the file ragvue-0.6.0-py3-none-any.whl.

File metadata

Download URL: ragvue-0.6.0-py3-none-any.whl
Upload date: May 8, 2026
Size: 72.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for ragvue-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c618c93a9a11218b4d836b732a4ce82c690f969131ee225444f236b8698203cd`
MD5	`be03948e5773832fc0b71bf0d05149f0`
BLAKE2b-256	`5749f3e61711926a3aaba3ad99ec2415d0c36f2a71ef6093353cd6b507bc3dc5`

See more details on using hashes here.

ragvue 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

✨ What you get

🚀 Installation

Install from source

Set up API keys

🏗️ Interface Design

🧠 Usage

A. Python API

B. Command-Line Interface (CLI)

1. ragvue-cli (main CLI)

Help & List available metrics

Manual mode

Agentic mode

2. ragvue-py (lightweight Python runner)

Help

Manual mode

Agentic mode

C. Streamlit UI (No-Code Interface)

Features

D. FastAPI REST API

Endpoints

Example requests

📄 Input Format

Metrics Overview

Core Evaluation Metrics (6)

Calibration Metrics (6)

[NEW] Reference-Free Complex Failure Mode Metrics (6)

[NEW] Lightweight Local Metrics (4)

Metric Selection Guide

By use case

By input availability

🔐 Licensing

📩 Contact

📚 Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. `ragvue-cli` (main CLI)

2. `ragvue-py` (lightweight Python runner)