Comprehensive LLM evaluation framework with statistical rigor
Project description
rotalabs-eval
Comprehensive LLM evaluation framework with statistical rigor. Run evaluations with confidence intervals, significance tests, and effect size analysis.
Overview
rotalabs-eval provides a complete toolkit for evaluating LLM outputs across multiple dimensions: lexical accuracy, semantic similarity, RAG quality, and LLM-as-judge assessments. It includes built-in statistical analysis so you can make data-driven decisions about model performance with proper uncertainty quantification.
Key Features
- 20+ Built-in Metrics: Exact match, F1, BLEU, ROUGE-L, BERTScore, embedding similarity, LLM-as-judge, RAG faithfulness, and more
- Statistical Rigor: Bootstrap confidence intervals, paired significance tests, effect sizes, and power analysis
- Multiple LLM Providers: OpenAI and Ollama support out of the box
- Agent Evaluation: Multi-turn trajectory scoring, tool use accuracy, and multi-agent debate analysis
- Flexible Backends: Run locally, or scale with Spark, Dask, or Ray (optional)
- Experiment Tracking: MLflow and Weights & Biases integrations
- Response Caching: SQLite disk cache and in-memory LRU cache to avoid redundant API calls
- Cost Tracking: Token counting and cost estimation for major LLM providers
Installation
Basic Installation
pip install rotalabs-eval
With Optional Dependencies
# OpenAI inference
pip install rotalabs-eval[openai]
# Local models via Ollama
pip install rotalabs-eval[ollama]
# Embedding-based metrics (BERTScore, semantic similarity)
pip install rotalabs-eval[embeddings]
# Distributed backends
pip install rotalabs-eval[spark]
pip install rotalabs-eval[dask]
pip install rotalabs-eval[ray]
# Experiment tracking
pip install rotalabs-eval[tracking] # MLflow
pip install rotalabs-eval[wandb] # Weights & Biases
# Visualization
pip install rotalabs-eval[viz]
# Everything
pip install rotalabs-eval[all]
# Development
pip install rotalabs-eval[dev]
Quick Start
Define and Run an Evaluation
import pandas as pd
from rotalabs_eval import ModelConfig, ModelProvider, MetricConfig, EvalTask
from rotalabs_eval.orchestrator import LocalOrchestrator
# Prepare your dataset
data = pd.DataFrame({
"question": ["What is Python?", "What is Rust?"],
"reference": ["A programming language", "A systems programming language"],
})
# Configure the model
model_config = ModelConfig(
provider=ModelProvider.OPENAI,
model_name="gpt-4o-mini",
api_key="sk-...", # or set OPENAI_API_KEY env var
temperature=0.0,
)
# Define the evaluation task
task = EvalTask(
task_id="qa_eval",
prompt_template="Answer concisely: {question}",
reference_column="reference",
)
# Choose metrics
metrics = [
MetricConfig(name="exact_match"),
MetricConfig(name="f1"),
MetricConfig(name="rouge_l"),
]
# Run
orchestrator = LocalOrchestrator()
result = orchestrator.run(data, task, model_config, metrics)
print(result)
Use Individual Metrics
from rotalabs_eval.metrics.lexical import ExactMatchMetric, F1Metric, BLEUMetric
exact = ExactMatchMetric()
print(exact.compute("hello world", "hello world")) # MetricResult(score=1.0)
f1 = F1Metric()
print(f1.compute("the cat sat", "the cat")) # MetricResult(score=0.8)
bleu = BLEUMetric()
print(bleu.compute("the cat is on the mat", "the cat sat on the mat"))
Statistical Comparisons
import numpy as np
from rotalabs_eval.statistics import (
bootstrap_ci,
paired_ttest,
cohens_d,
)
model_a_scores = np.array([0.82, 0.75, 0.91, 0.78, 0.85])
model_b_scores = np.array([0.79, 0.71, 0.88, 0.80, 0.83])
# Confidence interval for Model A's mean
ci = bootstrap_ci(model_a_scores, confidence_level=0.95)
print(f"Model A: {np.mean(model_a_scores):.3f} [{ci[0]:.3f}, {ci[1]:.3f}]")
# Is the difference significant?
sig = paired_ttest(model_a_scores, model_b_scores)
print(f"p-value: {sig.p_value:.4f}, significant: {sig.significant}")
# How large is the effect?
effect = cohens_d(model_a_scores, model_b_scores)
print(f"Cohen's d: {effect.value:.3f} ({effect.interpretation})")
Power Analysis
from rotalabs_eval.statistics.power import sample_size_for_mean_diff
# How many examples do I need to detect a 0.05 improvement?
result = sample_size_for_mean_diff(
effect_size=0.05,
std_dev=0.15,
alpha=0.05,
power=0.80,
)
print(f"Required sample size: {result.sample_size}")
Agent Evaluation
from rotalabs_eval.agents.trajectory import GoalCompletionMetric
from rotalabs_eval.agents.tool_use import ToolSelectionAccuracyMetric
# Evaluate goal completion from a trajectory
goal_metric = GoalCompletionMetric()
result = goal_metric.compute(
trajectory="User: Book a flight to NYC\nAssistant: I've booked your flight to NYC for tomorrow.",
reference="Book a flight",
)
print(f"Goal completion: {result.score}")
# Evaluate tool selection accuracy
tool_metric = ToolSelectionAccuracyMetric()
result = tool_metric.compute(
predicted_tools=["search", "book_flight"],
expected_tools=["search", "book_flight", "confirm"],
)
print(f"Tool selection accuracy: {result.score:.2f}")
Custom Metrics
from rotalabs_eval.metrics.custom import create_custom_metric
# Create a metric from a function
def word_count_ratio(prediction: str, reference: str) -> float:
pred_words = len(prediction.split())
ref_words = len(reference.split())
return min(pred_words, ref_words) / max(pred_words, ref_words) if ref_words else 0.0
WordCountRatio = create_custom_metric("word_count_ratio", word_count_ratio)
metric = WordCountRatio()
print(metric.compute("hello world", "hello beautiful world"))
Caching Responses
from rotalabs_eval.cache import MemoryCache, DiskCache
# In-memory LRU cache
cache = MemoryCache(max_size=1000)
cache.put("key1", {"response": "cached value"})
print(cache.get("key1"))
# Persistent SQLite cache
disk_cache = DiskCache(cache_dir="./eval_cache")
disk_cache.put("key1", {"response": "persisted value"})
Available Metrics
Lexical Metrics
| Metric | Class | Description |
|---|---|---|
exact_match |
ExactMatchMetric |
Exact string match (with optional normalization) |
f1 |
F1Metric |
Token-level F1 score |
bleu |
BLEUMetric |
BLEU score for n-gram overlap |
rouge_l |
ROUGELMetric |
ROUGE-L using longest common subsequence |
contains |
ContainsMetric |
Check if reference appears in prediction |
length_ratio |
LengthRatioMetric |
Length ratio between prediction and reference |
Semantic Metrics
| Metric | Class | Description |
|---|---|---|
bert_score |
BERTScoreMetric |
Contextual embedding similarity |
embedding_similarity |
EmbeddingSimilarityMetric |
Cosine similarity of sentence embeddings |
LLM-as-Judge Metrics
| Metric | Class | Description |
|---|---|---|
llm_judge |
LLMJudge |
Single-answer grading with LLM |
pairwise_judge |
PairwiseJudge |
Pairwise comparison between two models |
g_eval |
GEval |
G-Eval framework for multi-aspect evaluation |
RAG Metrics
| Metric | Class | Description |
|---|---|---|
context_relevance |
ContextRelevanceMetric |
Relevance of retrieved context to query |
faithfulness |
FaithfulnessMetric |
Whether answer is grounded in context |
answer_relevance |
AnswerRelevanceMetric |
Relevance of answer to the question |
Statistical Analysis
Confidence Intervals
bootstrap_ci()- Percentile bootstrapbootstrap_ci_bca()- Bias-corrected and accelerated bootstrapanalytical_ci_mean()- t-distribution CI for meansanalytical_ci_proportion()- Wilson/Normal/Clopper-Pearson for proportions
Significance Tests
paired_ttest()- Paired t-test for continuous metricsmcnemar_test()- McNemar's test for binary outcomeswilcoxon_signed_rank()- Non-parametric alternativebootstrap_significance()- Bootstrap permutation testchoose_test()- Auto-select appropriate test based on data
Effect Sizes
cohens_d()- Standardized mean differencehedges_g()- Small-sample corrected Cohen's dodds_ratio()- Odds ratio for binary outcomesrelative_improvement()- Percentage improvement
Power Analysis
sample_size_for_mean_diff()- Required n for detecting mean differencessample_size_for_proportion_diff()- Required n for proportion differencescompute_power()- Statistical power for a given sample size
Development
# Clone and install in development mode
git clone https://github.com/rotalabs/rotalabs-eval.git
cd rotalabs-eval
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Format code
black src/ tests/
ruff check src/ tests/
License
AGPL-3.0 License - see LICENSE for details.
Links
- PyPI: https://pypi.org/project/rotalabs-eval/
- GitHub: https://github.com/rotalabs/rotalabs-eval
- Documentation: https://rotalabs.github.io/rotalabs-eval/
- Website: https://rotalabs.ai
- Contact: research@rotalabs.ai
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rotalabs_eval-1.1.0.tar.gz.
File metadata
- Download URL: rotalabs_eval-1.1.0.tar.gz
- Upload date:
- Size: 208.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
613be3e99261ee72dc5cfc60284758676dfe161b6f052eec653237aec7c11370
|
|
| MD5 |
3d545d4c89a7c35dbabc06e51d7713b9
|
|
| BLAKE2b-256 |
48c18af5dbb67f6ec7ee19c2b8e826674f89c76aacedafcab7e6c06208ad6f1f
|
File details
Details for the file rotalabs_eval-1.1.0-py3-none-any.whl.
File metadata
- Download URL: rotalabs_eval-1.1.0-py3-none-any.whl
- Upload date:
- Size: 81.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39284ab153495e17b882ecfb34332e434f698792ac65768b6d9d06b2bbc9823d
|
|
| MD5 |
3cfa47aaa50df51c091955de1d0a61d0
|
|
| BLAKE2b-256 |
22586c895d38706c0f61cd6ca1ee252cd95982bc81de624834dbfdb561957f45
|