Skip to main content

The End-to-End LLM Evaluation Framework

Project description

Vero

Join our Discord

Vero is an open platform for evaluating and continuously monitoring AI pipelines with real-world rigor. It goes beyond standard benchmarking by generating edge-case user personas and stress-testing models across challenging scenarios, helping teams identify risks early and build more reliable AI systems.

Index


Key Features

  • Trace & Log Execution: Each query runs through the RAG pipeline is logged into an SQLite database, capturing the user query, retrieved context, reranked items, and the model’s output.
  • Component-level Metrics: Evaluate intermediate pipeline stages using metrics like Precision, Recall, Sufficiency, Citation, Overlap, and Ranking metrics (e.g. MRR, MAP, NDCG).
  • Generation Metrics: Measure semantic, factual, and alignment quality of generated outputs using metrics such as BERTScore, ROUGE, SEMScore, AlignScore, BLEURT, and G-Eval.
  • Modular & Extensible: Easily plug in new metric classes or custom scoring logic; the framework is designed to grow with your needs.
  • End-to-End Evaluation: Combine component metrics to understand the holistic performance of your RAG system — not just individual parts.

Flowchart

flowchart.png

Project Structure

.
├── src/
|   ├── vero/
│   │   ├── evaluator         # Main package for evaluation
│   │   ├── report generation workflow  # Report generation workflow
│   │   ├── test dataset generator  # Test dataset generator
|   │   ├── metrics/          # Main package for metrics
|   └── └──  all the metrics  # All the metrics are in here
└── tests/
    └── test_main.py/         # file for all the testing

Getting Started

Setup

Install via pip (recommended inside a virtualenv):

pip install vero-eval

Example Usage

from vero.rag import SimpleRAGPipeline
from vero.trace import TraceDB
from vero.eval import Evaluator

trace_db = TraceDB(db_path="runs.db")
pipeline = SimpleRAGPipeline(retriever="faiss", generator="openai", trace_db=trace_db)

# Run your pipeline
run = pipeline.run("Who invented the transistor?")
print("Answer:", run.answer)

# Later, compute metrics for all runs
evaluator = Evaluator(trace_db=trace_db)
results = evaluator.evaluate()
print(results)

Metrics Overview

The RAG Evaluation Framework supports three classes of metrics:

  • Generation Metrics — measure semantic/factual quality of answers.
  • Ranking Metrics — measure ranking quality of rerankers.
  • Retrieval Metrics — measure the quality and sufficiency of the retrieved context.

Evaluator

Overview

  • The Evaluator is a convenience wrapper to run multiple metrics over model outputs and retrieval results.
  • It orchestrates generation evaluation (text-generation metrics), retrieval evaluation (precision/recall/sufficiency), and reranker evaluation (NDCG, MAP, MRR). It produces CSV summaries by default.

Quick notes

  • The Evaluator uses project metric classes (e.g., BartScore, BertScore, RougeScore, SemScore, PrecisionScore, RecallScore, MeanAP, MeanRR, RerankerNDCG, CumulativeNDCG, etc.). These metrics are in vero.metrics and are referenced internally.
  • Many methods expect particular CSV column names (see "Expected CSV schemas").

Steps to evaluate your pipeline

Step 1 - Generation evaluation

  • Input: a CSV with "Context Retrieved" and "Answer" columns.
  • Result: Generation_Scores.csv with columns such as SemScore, BertScore, RougeLScore, BARTScore, BLUERTScore, G-Eval (Faithfulness).

Example:

from vero.evaluator.evaluator import Evaluator

evaluator = Evaluator()
# data_path must point to a CSV with columns "Context Retrieved" and "Answer"
df_scores = evaluator.evaluate_generation(data_path='testing.csv')
print(df_scores.head())

Step 2 - Preparing reranker inputs (parse ground truth + retriever output)

  • Use parse_retriever_data to convert ground-truth chunk ids and retriever outputs into a ranked_chunks_data.csv suitable for reranker evaluation.

Example:

from vero.evaluator.evaluator import Evaluator

evaluator = Evaluator()
# ground_truth_path: dataset with 'Chunk IDs' and 'Less Relevant Chunk IDs' columns
# data_path: retriever output with 'Context Retrieved' containing "id='...'"
evaluator.parse_retriever_data(
    ground_truth_path='test_dataset_generator.csv',
    data_path='testing.csv'
)
# This will produce 'ranked_chunks_data.csv'

Step 3 - Retrieval evaluation (precision, recall, sufficiency)

  • Inputs:
    • retriever_data_path: a CSV that contains 'Retrieved Chunk IDs' and 'True Chunk IDs' columns (lists or strings).
    • data_path: the generation CSV with 'Context Retrieved' and 'Question' (for sufficiency).
  • Result: Retrieval_Scores.csv

Example:

from vero.evaluator.evaluator import Evaluator

evaluator = Evaluator()
df_retrieval_scores = evaluator.evaluate_retrieval(
    data_path='testing.csv',
    retriever_data_path='ranked_chunks_data.csv'
)
print(df_retrieval_scores.head())

Step 4 - Reranker evaluation (MAP, MRR, NDCG) Example:

from vero.evaluator.evaluator import Evaluator

evaluator = Evaluator()
df_reranker_scores = evaluator.evaluate_reranker(
    ground_truth_path='test_dataset_generator.csv',
    retriever_data_path='ranked_chunks_data.csv'
)
print(df_reranker_scores)

Lower-level metric usage

To run a single metric directly you can instantiate the metric class. For example, to compute BARTScore or BertScore per pair:

from vero.metrics import BartScore, BertScore

with BartScore() as bs:
    bart_results = [bs.evaluate(context, answer) for context, answer in zip(contexts, answers)]

with BertScore() as bert:
    bert_results = [bert.evaluate(context, answer) for context, answer in zip(contexts, answers)]

Test Dataset Generation

Overview

  • The Test Dataset Generation module creates high-quality question-answer pairs derived from your document collection. It generates challenging queries designed to reveal retrieval and reasoning failures in RAG systems, and considering edge-case user personas.
  • Internally it chunks documents, clusters related chunks, and uses an LLM to produce QA items with ground-truth chunk IDs and metadata.

Example

from vero.test_dataset_generator import generate_and_save

# Generate 100 queries from PDFs stored in ./data/pdfs directory and save outputs in test_dataset directory
generate_and_save(
    data_path='./data/pdfs/',
    usecase='Vitamin chatbot catering to general users for their daily queries',
    save_path='test_dataset',
    n_queries=100
)

Report Generation

Overview

  • The Report Generation module consolidates evaluation outputs from generation, retrieval, and reranking into a final report.
  • It orchestrates a stateful workflow that processes CSV results from various evaluators and synthesizes comprehensive insights and recommendations.

Example Usage

from vero.report_generation_workflow import ReportGenerator

# Initialize the report generator
report_generator = ReportGenerator()

# Generate the final report by providing:
# - Pipeline configuration JSON file
# - Generation, Retrieval, and Reranker evaluation CSV files
report_generator.generate_report(
    'pipe_config_data.json',
    'Generation_Scores.csv',
    'Retrieval_Scores.csv',
    'Reranked_Scores.csv'
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vero_eval-0.0.1.5.tar.gz (640.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vero_eval-0.0.1.5-py3-none-any.whl (105.2 kB view details)

Uploaded Python 3

File details

Details for the file vero_eval-0.0.1.5.tar.gz.

File metadata

  • Download URL: vero_eval-0.0.1.5.tar.gz
  • Upload date:
  • Size: 640.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.7

File hashes

Hashes for vero_eval-0.0.1.5.tar.gz
Algorithm Hash digest
SHA256 f2af97c709752d2a98769a67899108b65acedc168fb11152f5c7d049d95fbbb8
MD5 0de76dee5c4fba3e9311ef9ad748bb21
BLAKE2b-256 0986a94d1df629578b4dd334c15644572eb268dcf765723f4fa0c5fdc07771bf

See more details on using hashes here.

File details

Details for the file vero_eval-0.0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for vero_eval-0.0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 593966cf4fc42865b7d6f4cf4c16bc2ff972b3c86ffbe9bae7f999055ba36c85
MD5 09c3997b4fb41c7a82e263c6b03b9e2a
BLAKE2b-256 6d12d028b98365edb2564cbb970b0e9897061e9db574a029db33c644c74c47a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page