The End-to-End LLM Evaluation Framework

These details have not been verified by PyPI

Project links

Project description

RAG Evaluation Framework

A framework for evaluating Retrieval-Augmented Generation (RAG) pipelines with built-in tracing, logging and evaluation metrics.

Project Details

The project provides a comprehensive framework for evaluating Retrieval-Augmented Generation (RAG) pipelines. It includes simple RAG implementations and a robust tracing system that logs every step of the pipeline's execution to an SQLite database for detailed analysis and debugging. It has built in metrics for deep and end to end evaluation of all the different blocks in RAG pipelines i.e. query, retriever, reranker and generator. The core idea is to trace key information for each RAG run—such as the user query, the retrieved context, and the final LLM-generated answer, store it systematically and then evaluate them based on ground truth.

Project Structure

.
├── src/
|   ├── vero/
|   │   ├── metrics/          # Main package for metrics
|   └── └──  all the metrics  # All the metrics are in here
└── tests/
    └── test_main.py/         # file for all the testing

Deep Dive into metrics

Generation Metrics

BERTScore

Uses BERTScorer.
Pass retrieved context and generated output.
Returns precision, recall and f1-score.

ROUGE-L

Uses ROUGE-L which focuses on the Longest Common Subsequence (LCS) between a generated summary and a reference summary.
Pass retrieved context and generated output.
Returns ROUGE score - precesion, recall and f1-score.

SEMScore

Uses embeddings of retrieved context and generated output and calculated cosine similarity between them.
Pass retrieved context and generated output.
Returns a single SEMScore.

SEMScore	Inference
closer to 1	more semantically similar
closer to 0	unrelated
negative score	semantically opposite

BleurtScore (Weighted Semantic Similarity)

A unique implementation of BluertScore where not only we calculate BluertScore but also perform weighted sum to give out the more nuanced score.
With this implementation it can be pretty dynamic as it can be used as both generation and retriever metric.
- As generation metric - it gives insights on which chunks play major part in output generation and they will recieve higher weights than others.
- As retriever metric - it gives insights if their retriever is good at capturing conceptual and semantic relationships, even if it misses the exact answer.
  - It can be very useful for debugging, e.g.:
    - If Context Recall is low, but Weighted Semantic Similarity score is high, it tells the developer: "Your retriever is finding documents that are about the right topic, but it's failing to find the specific sentence or fact needed for the answer"
    - If both scores are low, the retriever is failing at a more fundamental level.
Pass retrieved context and generated output or user query.
Returns a single weight BluertScore.

BluertScore	Inference
closer to 1	high semantic similarity
closer to 0	low semantic similarity

AlignScore

Measures the faithfulness of generated answer to the retrieved context.
Pass retrieved context and generated output.
Returns a single AlignScore.

AlignScore	Inference
closer to 1	high factual consistency
closer to 0	low factual consistency

BartScore

Uses BartScorer and is a type of comparision score.
Pass retrieved context and generated output.
Returns a BartScore.
This score does not hold any meaning in itself, it can be used to compare two models or versions of RAG pipelines and comparision can done as - higher the score better the generation capabilites of that pipeline compared to another.

G-Eval

A unique implementation of g-eval where we calculate the weighted sum of all the possible scores with their linear probabilities and get the average of it as the final score.
We provide the prompting capability where if you want you can provide your own custom prompt for evaluation or you can pass the metric name, metric description(optional) and we will generate the prompt for you.
We also provide the polling capability which basically runs the g-eval any given number of times(default is 5) and get an average score as final score.
Pass the references and candidate (optional : custom prompt, metric name, metric description, polling flag and polling number).
Returns a final G-Eval score for the passed metric or prompt.

Domain Overlap Score

Calculates the domain specific overlap score.
Pass key terms and generated output.
Returns overlap score.

Numerical Hallucination Score

Calculates Numerical Hallucination Score.
Pass retrieved context and generated output.
Returns hallucination score.

Ranking Metrics

Mean Reciprocal Rank (MRR)

Direct implementation of MRR.
Pass the reranked docs along with ground truth.
Returns MRR.

Mean Average Precision (MAP)

Direct implementation of MAP.
Pass the reranked docs along with ground truth.
Returns MAP.

Reranker NDCG@k

Direct implementation of NDCG@k.
Pass the reranked docs along with ground truth and k value.
Returns the NDCG@k.

Cumulative NDCG

Unique implementation of NDCG@k that can be used to evaluate the cumulative performance of retriever and reranker.
Pass the reranked docs along with ground truth.
Returns the NDCG.

Retrieval Metrics

Precision Score

Calculates Precision Score.
Pass the retrieved context and the ground truth context.
Returns the context precision score.

Recall Score

Calculates Recall Score.
Pass the retrieved context and the ground truth context.
Returns the context recall score.

Context Sufficiency Score

Calculates the sufficiency score of retrieved context for the user query.
Uses LLM to score the metric.
Returns the context sufficiency score.

Citation Score

Calculates Citation Score of the retrieved context.
Pass the cited context and ground truth citations.
Returns the citation score.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.1.6

Nov 14, 2025

0.0.1.5

Nov 13, 2025

0.0.1.4

Nov 5, 2025

0.0.1.2

Oct 19, 2025

This version

0.0.1.1

Sep 11, 2025

0.0.1

Sep 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vero_eval-0.0.1.1.tar.gz (38.0 kB view details)

Uploaded Sep 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vero_eval-0.0.1.1-py3-none-any.whl (52.2 kB view details)

Uploaded Sep 11, 2025 Python 3

File details

Details for the file vero_eval-0.0.1.1.tar.gz.

File metadata

Download URL: vero_eval-0.0.1.1.tar.gz
Upload date: Sep 11, 2025
Size: 38.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for vero_eval-0.0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`b5db169ebd101b26d1c8d9c0e1608832b377d49b695892b04c7b33ff89338fac`
MD5	`bc636cffb37a15acfa30e057c5fdd8a1`
BLAKE2b-256	`bc3ab053a4644567441d5a77beb383b1387fcb152217b184e017e8f32fbc9393`

See more details on using hashes here.

File details

Details for the file vero_eval-0.0.1.1-py3-none-any.whl.

File metadata

Download URL: vero_eval-0.0.1.1-py3-none-any.whl
Upload date: Sep 11, 2025
Size: 52.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for vero_eval-0.0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e86652d29e0131a44cf3ebd11122a331c6cb147d98e5be0e049601995f3ffccc`
MD5	`b4aa7f2486ee94e27ea4abf96bfef6da`
BLAKE2b-256	`3c64cc11c88c6dd49514aef5077e57d9efd6e066cb2f96e415a5ae39a7bb4431`

See more details on using hashes here.

vero-eval 0.0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RAG Evaluation Framework

Project Details

Project Structure

Deep Dive into metrics

Generation Metrics

BERTScore

ROUGE-L

SEMScore

BleurtScore (Weighted Semantic Similarity)

AlignScore

BartScore

G-Eval

Domain Overlap Score

Numerical Hallucination Score

Ranking Metrics

Mean Reciprocal Rank (MRR)

Mean Average Precision (MAP)

Reranker NDCG@k

Cumulative NDCG

Retrieval Metrics

Precision Score

Recall Score

Context Sufficiency Score

Citation Score

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes