Skip to main content

Comprehensive LLM evaluation framework with statistical rigor

Project description

rotalabs-eval

Comprehensive LLM evaluation framework with statistical rigor. Run evaluations with confidence intervals, significance tests, and effect size analysis.

Overview

rotalabs-eval provides a complete toolkit for evaluating LLM outputs across multiple dimensions: lexical accuracy, semantic similarity, RAG quality, and LLM-as-judge assessments. It includes built-in statistical analysis so you can make data-driven decisions about model performance with proper uncertainty quantification.

Key Features

  • 20+ Built-in Metrics: Exact match, F1, BLEU, ROUGE-L, BERTScore, embedding similarity, LLM-as-judge, RAG faithfulness, and more
  • Statistical Rigor: Bootstrap confidence intervals, paired significance tests, effect sizes, and power analysis
  • Multiple LLM Providers: OpenAI and Ollama support out of the box
  • Agent Evaluation: Multi-turn trajectory scoring, tool use accuracy, and multi-agent debate analysis
  • Flexible Backends: Run locally, or scale with Spark, Dask, or Ray (optional)
  • Experiment Tracking: MLflow and Weights & Biases integrations
  • Response Caching: SQLite disk cache and in-memory LRU cache to avoid redundant API calls
  • Cost Tracking: Token counting and cost estimation for major LLM providers

Installation

Basic Installation

pip install rotalabs-eval

With Optional Dependencies

# OpenAI inference
pip install rotalabs-eval[openai]

# Local models via Ollama
pip install rotalabs-eval[ollama]

# Embedding-based metrics (BERTScore, semantic similarity)
pip install rotalabs-eval[embeddings]

# Distributed backends
pip install rotalabs-eval[spark]
pip install rotalabs-eval[dask]
pip install rotalabs-eval[ray]

# Experiment tracking
pip install rotalabs-eval[tracking]    # MLflow
pip install rotalabs-eval[wandb]       # Weights & Biases

# Visualization
pip install rotalabs-eval[viz]

# Everything
pip install rotalabs-eval[all]

# Development
pip install rotalabs-eval[dev]

Quick Start

Define and Run an Evaluation

import pandas as pd
from rotalabs_eval import ModelConfig, ModelProvider, MetricConfig, EvalTask
from rotalabs_eval.orchestrator import LocalOrchestrator

# Prepare your dataset
data = pd.DataFrame({
    "question": ["What is Python?", "What is Rust?"],
    "reference": ["A programming language", "A systems programming language"],
})

# Configure the model
model_config = ModelConfig(
    provider=ModelProvider.OPENAI,
    model_name="gpt-4o-mini",
    api_key="sk-...",  # or set OPENAI_API_KEY env var
    temperature=0.0,
)

# Define the evaluation task
task = EvalTask(
    task_id="qa_eval",
    prompt_template="Answer concisely: {question}",
    reference_column="reference",
)

# Choose metrics
metrics = [
    MetricConfig(name="exact_match"),
    MetricConfig(name="f1"),
    MetricConfig(name="rouge_l"),
]

# Run
orchestrator = LocalOrchestrator()
result = orchestrator.run(data, task, model_config, metrics)
print(result)

Use Individual Metrics

from rotalabs_eval.metrics.lexical import ExactMatchMetric, F1Metric, BLEUMetric

exact = ExactMatchMetric()
print(exact.compute("hello world", "hello world"))  # MetricResult(score=1.0)

f1 = F1Metric()
print(f1.compute("the cat sat", "the cat"))  # MetricResult(score=0.8)

bleu = BLEUMetric()
print(bleu.compute("the cat is on the mat", "the cat sat on the mat"))

Statistical Comparisons

import numpy as np
from rotalabs_eval.statistics import (
    bootstrap_ci,
    paired_ttest,
    cohens_d,
)

model_a_scores = np.array([0.82, 0.75, 0.91, 0.78, 0.85])
model_b_scores = np.array([0.79, 0.71, 0.88, 0.80, 0.83])

# Confidence interval for Model A's mean
ci = bootstrap_ci(model_a_scores, confidence_level=0.95)
print(f"Model A: {np.mean(model_a_scores):.3f} [{ci[0]:.3f}, {ci[1]:.3f}]")

# Is the difference significant?
sig = paired_ttest(model_a_scores, model_b_scores)
print(f"p-value: {sig.p_value:.4f}, significant: {sig.significant}")

# How large is the effect?
effect = cohens_d(model_a_scores, model_b_scores)
print(f"Cohen's d: {effect.value:.3f} ({effect.interpretation})")

Power Analysis

from rotalabs_eval.statistics.power import sample_size_for_mean_diff

# How many examples do I need to detect a 0.05 improvement?
result = sample_size_for_mean_diff(
    effect_size=0.05,
    std_dev=0.15,
    alpha=0.05,
    power=0.80,
)
print(f"Required sample size: {result.sample_size}")

Agent Evaluation

from rotalabs_eval.agents.trajectory import GoalCompletionMetric
from rotalabs_eval.agents.tool_use import ToolSelectionAccuracyMetric

# Evaluate goal completion from a trajectory
goal_metric = GoalCompletionMetric()
result = goal_metric.compute(
    trajectory="User: Book a flight to NYC\nAssistant: I've booked your flight to NYC for tomorrow.",
    reference="Book a flight",
)
print(f"Goal completion: {result.score}")

# Evaluate tool selection accuracy
tool_metric = ToolSelectionAccuracyMetric()
result = tool_metric.compute(
    predicted_tools=["search", "book_flight"],
    expected_tools=["search", "book_flight", "confirm"],
)
print(f"Tool selection accuracy: {result.score:.2f}")

Custom Metrics

from rotalabs_eval.metrics.custom import create_custom_metric

# Create a metric from a function
def word_count_ratio(prediction: str, reference: str) -> float:
    pred_words = len(prediction.split())
    ref_words = len(reference.split())
    return min(pred_words, ref_words) / max(pred_words, ref_words) if ref_words else 0.0

WordCountRatio = create_custom_metric("word_count_ratio", word_count_ratio)
metric = WordCountRatio()
print(metric.compute("hello world", "hello beautiful world"))

Caching Responses

from rotalabs_eval.cache import MemoryCache, DiskCache

# In-memory LRU cache
cache = MemoryCache(max_size=1000)
cache.put("key1", {"response": "cached value"})
print(cache.get("key1"))

# Persistent SQLite cache
disk_cache = DiskCache(cache_dir="./eval_cache")
disk_cache.put("key1", {"response": "persisted value"})

Available Metrics

Lexical Metrics

Metric Class Description
exact_match ExactMatchMetric Exact string match (with optional normalization)
f1 F1Metric Token-level F1 score
bleu BLEUMetric BLEU score for n-gram overlap
rouge_l ROUGELMetric ROUGE-L using longest common subsequence
contains ContainsMetric Check if reference appears in prediction
length_ratio LengthRatioMetric Length ratio between prediction and reference

Semantic Metrics

Metric Class Description
bert_score BERTScoreMetric Contextual embedding similarity
embedding_similarity EmbeddingSimilarityMetric Cosine similarity of sentence embeddings

LLM-as-Judge Metrics

Metric Class Description
llm_judge LLMJudge Single-answer grading with LLM
pairwise_judge PairwiseJudge Pairwise comparison between two models
g_eval GEval G-Eval framework for multi-aspect evaluation

RAG Metrics

Metric Class Description
context_relevance ContextRelevanceMetric Relevance of retrieved context to query
faithfulness FaithfulnessMetric Whether answer is grounded in context
answer_relevance AnswerRelevanceMetric Relevance of answer to the question

Statistical Analysis

Confidence Intervals

  • bootstrap_ci() - Percentile bootstrap
  • bootstrap_ci_bca() - Bias-corrected and accelerated bootstrap
  • analytical_ci_mean() - t-distribution CI for means
  • analytical_ci_proportion() - Wilson/Normal/Clopper-Pearson for proportions

Significance Tests

  • paired_ttest() - Paired t-test for continuous metrics
  • mcnemar_test() - McNemar's test for binary outcomes
  • wilcoxon_signed_rank() - Non-parametric alternative
  • bootstrap_significance() - Bootstrap permutation test
  • choose_test() - Auto-select appropriate test based on data

Effect Sizes

  • cohens_d() - Standardized mean difference
  • hedges_g() - Small-sample corrected Cohen's d
  • odds_ratio() - Odds ratio for binary outcomes
  • relative_improvement() - Percentage improvement

Power Analysis

  • sample_size_for_mean_diff() - Required n for detecting mean differences
  • sample_size_for_proportion_diff() - Required n for proportion differences
  • compute_power() - Statistical power for a given sample size

Development

# Clone and install in development mode
git clone https://github.com/rotalabs/rotalabs-eval.git
cd rotalabs-eval
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Format code
black src/ tests/
ruff check src/ tests/

License

MIT License - see LICENSE for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rotalabs_eval-1.0.0.tar.gz (207.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rotalabs_eval-1.0.0-py3-none-any.whl (80.2 kB view details)

Uploaded Python 3

File details

Details for the file rotalabs_eval-1.0.0.tar.gz.

File metadata

  • Download URL: rotalabs_eval-1.0.0.tar.gz
  • Upload date:
  • Size: 207.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for rotalabs_eval-1.0.0.tar.gz
Algorithm Hash digest
SHA256 41cdb628dd1f843c1c830504932a103a683052e7e588fc40573688bdb71d594f
MD5 94a832846e191802e674481730334954
BLAKE2b-256 a191f7d3033545ff45e6a8e6bfa926715173b28001ea754b20e6f6f0acf7f04a

See more details on using hashes here.

File details

Details for the file rotalabs_eval-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: rotalabs_eval-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 80.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for rotalabs_eval-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7880bfd9b44bb673c08219bfa7a8d7e8a6c4646ab19e53dadc8d9b05fb1f8dd8
MD5 cc279dc0224bdb5753b39d01ba1d8dd0
BLAKE2b-256 9f065544f516541825f02c924606bd8e5c1e86a7b1326a3d786a38901387bb7c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page