The End-to-End LLM Evaluation Framework
Project description
RAG Evaluation Framework
A framework for evaluating Retrieval-Augmented Generation (RAG) pipelines with built-in tracing, logging and evaluation metrics.
Project Details
The project provides a comprehensive framework for evaluating Retrieval-Augmented Generation (RAG) pipelines. It includes simple RAG implementations and a robust tracing system that logs every step of the pipeline's execution to an SQLite database for detailed analysis and debugging. It has built in metrics for deep and end to end evaluation of all the different blocks in RAG pipelines i.e. query, retriever, reranker and generator. The core idea is to trace key information for each RAG run—such as the user query, the retrieved context, and the final LLM-generated answer, store it systematically and then evaluate them based on ground truth.
Project Structure
.
├── src/
| ├── vero/
| │ ├── metrics/ # Main package for metrics
| └── └── all the metrics # All the metrics are in here
└── tests/
└── test_main.py/ # file for all the testing
Deep Dive into metrics
Generation Metrics
BERTScore
- Uses BERTScorer.
- Pass retrieved context and generated output.
- Returns precision, recall and f1-score.
ROUGE-L
- Uses ROUGE-L which focuses on the Longest Common Subsequence (LCS) between a generated summary and a reference summary.
- Pass retrieved context and generated output.
- Returns ROUGE score - precesion, recall and f1-score.
SEMScore
- Uses embeddings of retrieved context and generated output and calculated cosine similarity between them.
- Pass retrieved context and generated output.
- Returns a single SEMScore.
| SEMScore | Inference |
|---|---|
| closer to 1 | more semantically similar |
| closer to 0 | unrelated |
| negative score | semantically opposite |
BleurtScore (Weighted Semantic Similarity)
- A unique implementation of BluertScore where not only we calculate BluertScore but also perform weighted sum to give out the more nuanced score.
- With this implementation it can be pretty dynamic as it can be used as both generation and retriever metric.
- As generation metric - it gives insights on which chunks play major part in output generation and they will recieve higher weights than others.
- As retriever metric - it gives insights if their retriever is good at capturing conceptual and semantic relationships, even if it misses the exact answer.
- It can be very useful for debugging, e.g.:
- If Context Recall is low, but Weighted Semantic Similarity score is high, it tells the developer: "Your retriever is finding documents that are about the right topic, but it's failing to find the specific sentence or fact needed for the answer"
- If both scores are low, the retriever is failing at a more fundamental level.
- It can be very useful for debugging, e.g.:
- Pass retrieved context and generated output or user query.
- Returns a single weight BluertScore.
| BluertScore | Inference |
|---|---|
| closer to 1 | high semantic similarity |
| closer to 0 | low semantic similarity |
AlignScore
- Measures the faithfulness of generated answer to the retrieved context.
- Pass retrieved context and generated output.
- Returns a single AlignScore.
| AlignScore | Inference |
|---|---|
| closer to 1 | high factual consistency |
| closer to 0 | low factual consistency |
BartScore
- Uses BartScorer and is a type of comparision score.
- Pass retrieved context and generated output.
- Returns a BartScore.
- This score does not hold any meaning in itself, it can be used to compare two models or versions of RAG pipelines and comparision can done as - higher the score better the generation capabilites of that pipeline compared to another.
G-Eval
- A unique implementation of g-eval where we calculate the weighted sum of all the possible scores with their linear probabilities and get the average of it as the final score.
- We provide the prompting capability where if you want you can provide your own custom prompt for evaluation or you can pass the metric name, metric description(optional) and we will generate the prompt for you.
- We also provide the polling capability which basically runs the g-eval any given number of times(default is 5) and get an average score as final score.
- Pass the references and candidate (optional : custom prompt, metric name, metric description, polling flag and polling number).
- Returns a final G-Eval score for the passed metric or prompt.
Domain Overlap Score
- Calculates the domain specific overlap score.
- Pass key terms and generated output.
- Returns overlap score.
Numerical Hallucination Score
- Calculates Numerical Hallucination Score.
- Pass retrieved context and generated output.
- Returns hallucination score.
Ranking Metrics
Mean Reciprocal Rank (MRR)
- Direct implementation of MRR.
- Pass the reranked docs along with ground truth.
- Returns MRR.
Mean Average Precision (MAP)
- Direct implementation of MAP.
- Pass the reranked docs along with ground truth.
- Returns MAP.
Reranker NDCG@k
- Direct implementation of NDCG@k.
- Pass the reranked docs along with ground truth and k value.
- Returns the NDCG@k.
Cumulative NDCG
- Unique implementation of NDCG@k that can be used to evaluate the cumulative performance of retriever and reranker.
- Pass the reranked docs along with ground truth.
- Returns the NDCG.
Retrieval Metrics
Precision Score
- Calculates Precision Score.
- Pass the retrieved context and the ground truth context.
- Returns the context precision score.
Recall Score
- Calculates Recall Score.
- Pass the retrieved context and the ground truth context.
- Returns the context recall score.
Context Sufficiency Score
- Calculates the sufficiency score of retrieved context for the user query.
- Uses LLM to score the metric.
- Returns the context sufficiency score.
Citation Score
- Calculates Citation Score of the retrieved context.
- Pass the cited context and ground truth citations.
- Returns the citation score.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vero_eval-0.0.1.1.tar.gz.
File metadata
- Download URL: vero_eval-0.0.1.1.tar.gz
- Upload date:
- Size: 38.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5db169ebd101b26d1c8d9c0e1608832b377d49b695892b04c7b33ff89338fac
|
|
| MD5 |
bc636cffb37a15acfa30e057c5fdd8a1
|
|
| BLAKE2b-256 |
bc3ab053a4644567441d5a77beb383b1387fcb152217b184e017e8f32fbc9393
|
File details
Details for the file vero_eval-0.0.1.1-py3-none-any.whl.
File metadata
- Download URL: vero_eval-0.0.1.1-py3-none-any.whl
- Upload date:
- Size: 52.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e86652d29e0131a44cf3ebd11122a331c6cb147d98e5be0e049601995f3ffccc
|
|
| MD5 |
b4aa7f2486ee94e27ea4abf96bfef6da
|
|
| BLAKE2b-256 |
3c64cc11c88c6dd49514aef5077e57d9efd6e066cb2f96e415a5ae39a7bb4431
|