RAG evaluation with exactly 6 metrics. 3 variables, 6 relationships, nothing more.
Project description
trieval
RAG evaluation with exactly 6 metrics. Nothing more.
Every RAG system has 3 variables: Q (Question), C (Context), A (Answer).
3 variables → 6 pairwise relationships → 6 metrics.
The 6 Metrics
| # | Metric | Notation | Evaluates |
|---|---|---|---|
| 1 | Context Relevance | C|Q | Is the retrieved context relevant to the question? |
| 2 | Faithfulness | A|C | Does the answer stick to the context? |
| 3 | Answer Relevance | A|Q | Does the answer solve the user's question? |
| 4 | Context Support | C|A | Does the context fully support the answer? |
| 5 | Answerability | Q|C | Can this question be answered with this context? |
| 6 | Self-Containment | Q|A | Can someone understand the question from the answer? |
Install
pip install trieval
Quick Start
from trieval import Evaluator
evaluator = Evaluator(model="openai:gpt-4o-mini")
result = await evaluator.evaluate(
question="What is photosynthesis?",
context="Photosynthesis is the process by which plants convert sunlight into energy.",
answer="Photosynthesis is how plants make food from sunlight.",
)
print(result.overall_score) # 0.0–1.0
print(result.retrieval_score) # avg of C|Q + Q|C
print(result.generation_score) # avg of A|C + A|Q
print(result.diagnose()) # ["All metrics healthy"] or failure categories
Failure Diagnosis
When your RAG system fails, it's always one of these:
- Retrieval issues — C|Q and Q|C scores are low (wrong context retrieved)
- Generation issues — A|C and A|Q scores are low (bad answer generation)
- End-to-end mismatch — A|C is fine but C|A is low (faithful but unsupported)
Architecture
Built with pydantic-ai (LLM-based metric agents) and LangGraph (evaluation workflow orchestration).
RAGInput(Q, C, A)
↓
Evaluator.evaluate()
↓
LangGraph: evaluate_metrics → diagnose_failures
↓
EvaluationResult (scores + diagnosis)
Each metric is a pydantic-ai Agent with a tailored system prompt. All 6 run concurrently via asyncio.gather inside the LangGraph evaluation node.
Development
uv sync --group dev
pytest # run tests
pytest --cov=trieval --cov-branch # with coverage
ruff check trieval/ tests/ # lint
ruff format trieval/ tests/ # format
mypy trieval/ # type check
Documentation
- API Reference — Full API for
Evaluator,EvaluationResult,MetricResult,RAGInput, and individual metric functions - Changelog — Version history
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trieval-0.0.1.tar.gz.
File metadata
- Download URL: trieval-0.0.1.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73b54b26ff32a18cb935ddd51642746ee86b5a3bab3f93032e4b394bfb4a8e59
|
|
| MD5 |
a49541d2cb863efb97a40282f4302d22
|
|
| BLAKE2b-256 |
7139739d86c529dbbaeb102fb3935e8dffefc2d960556d8debfcf6d999e7e00e
|
File details
Details for the file trieval-0.0.1-py3-none-any.whl.
File metadata
- Download URL: trieval-0.0.1-py3-none-any.whl
- Upload date:
- Size: 12.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f0d56c3a325d195660680a25720b7699718eeb4b0035c118963f81b02847bdb
|
|
| MD5 |
c65d5f1bec894c301d28b0c00b4da4bb
|
|
| BLAKE2b-256 |
0d9e5878f374fefdb92a19d6294eb622c6d783505146b83f7e723183c4539e5c
|