Comprehensive LLM evaluation framework with statistical rigor

These details have not been verified by PyPI

Project links

Project description

rotalabs-eval

Comprehensive LLM evaluation framework with statistical rigor. Run evaluations with confidence intervals, significance tests, and effect size analysis.

Overview

rotalabs-eval provides a complete toolkit for evaluating LLM outputs across multiple dimensions: lexical accuracy, semantic similarity, RAG quality, and LLM-as-judge assessments. It includes built-in statistical analysis so you can make data-driven decisions about model performance with proper uncertainty quantification.

Key Features

20+ Built-in Metrics: Exact match, F1, BLEU, ROUGE-L, BERTScore, embedding similarity, LLM-as-judge, RAG faithfulness, and more
Statistical Rigor: Bootstrap confidence intervals, paired significance tests, effect sizes, and power analysis
Multiple LLM Providers: OpenAI and Ollama support out of the box
Agent Evaluation: Multi-turn trajectory scoring, tool use accuracy, and multi-agent debate analysis
Flexible Backends: Run locally, or scale with Spark, Dask, or Ray (optional)
Experiment Tracking: MLflow and Weights & Biases integrations
Response Caching: SQLite disk cache and in-memory LRU cache to avoid redundant API calls
Cost Tracking: Token counting and cost estimation for major LLM providers

Installation

Basic Installation

pip install rotalabs-eval

With Optional Dependencies

# OpenAI inference
pip install rotalabs-eval[openai]

# Local models via Ollama
pip install rotalabs-eval[ollama]

# Embedding-based metrics (BERTScore, semantic similarity)
pip install rotalabs-eval[embeddings]

# Distributed backends
pip install rotalabs-eval[spark]
pip install rotalabs-eval[dask]
pip install rotalabs-eval[ray]

# Experiment tracking
pip install rotalabs-eval[tracking]    # MLflow
pip install rotalabs-eval[wandb]       # Weights & Biases

# Visualization
pip install rotalabs-eval[viz]

# Everything
pip install rotalabs-eval[all]

# Development
pip install rotalabs-eval[dev]

Quick Start

Define and Run an Evaluation

import pandas as pd
from rotalabs_eval import ModelConfig, ModelProvider, MetricConfig, EvalTask
from rotalabs_eval.orchestrator import LocalOrchestrator

# Prepare your dataset
data = pd.DataFrame({
    "question": ["What is Python?", "What is Rust?"],
    "reference": ["A programming language", "A systems programming language"],
})

# Configure the model
model_config = ModelConfig(
    provider=ModelProvider.OPENAI,
    model_name="gpt-4o-mini",
    api_key="sk-...",  # or set OPENAI_API_KEY env var
    temperature=0.0,
)

# Define the evaluation task
task = EvalTask(
    task_id="qa_eval",
    prompt_template="Answer concisely: {question}",
    reference_column="reference",
)

# Choose metrics
metrics = [
    MetricConfig(name="exact_match"),
    MetricConfig(name="f1"),
    MetricConfig(name="rouge_l"),
]

# Run
orchestrator = LocalOrchestrator()
result = orchestrator.run(data, task, model_config, metrics)
print(result)

Use Individual Metrics

from rotalabs_eval.metrics.lexical import ExactMatchMetric, F1Metric, BLEUMetric

exact = ExactMatchMetric()
print(exact.compute("hello world", "hello world"))  # MetricResult(score=1.0)

f1 = F1Metric()
print(f1.compute("the cat sat", "the cat"))  # MetricResult(score=0.8)

bleu = BLEUMetric()
print(bleu.compute("the cat is on the mat", "the cat sat on the mat"))

Statistical Comparisons

import numpy as np
from rotalabs_eval.statistics import (
    bootstrap_ci,
    paired_ttest,
    cohens_d,
)

model_a_scores = np.array([0.82, 0.75, 0.91, 0.78, 0.85])
model_b_scores = np.array([0.79, 0.71, 0.88, 0.80, 0.83])

# Confidence interval for Model A's mean
ci = bootstrap_ci(model_a_scores, confidence_level=0.95)
print(f"Model A: {np.mean(model_a_scores):.3f} [{ci[0]:.3f}, {ci[1]:.3f}]")

# Is the difference significant?
sig = paired_ttest(model_a_scores, model_b_scores)
print(f"p-value: {sig.p_value:.4f}, significant: {sig.significant}")

# How large is the effect?
effect = cohens_d(model_a_scores, model_b_scores)
print(f"Cohen's d: {effect.value:.3f} ({effect.interpretation})")

Power Analysis

from rotalabs_eval.statistics.power import sample_size_for_mean_diff

# How many examples do I need to detect a 0.05 improvement?
result = sample_size_for_mean_diff(
    effect_size=0.05,
    std_dev=0.15,
    alpha=0.05,
    power=0.80,
)
print(f"Required sample size: {result.sample_size}")

Agent Evaluation

from rotalabs_eval.agents.trajectory import GoalCompletionMetric
from rotalabs_eval.agents.tool_use import ToolSelectionAccuracyMetric

# Evaluate goal completion from a trajectory
goal_metric = GoalCompletionMetric()
result = goal_metric.compute(
    trajectory="User: Book a flight to NYC\nAssistant: I've booked your flight to NYC for tomorrow.",
    reference="Book a flight",
)
print(f"Goal completion: {result.score}")

# Evaluate tool selection accuracy
tool_metric = ToolSelectionAccuracyMetric()
result = tool_metric.compute(
    predicted_tools=["search", "book_flight"],
    expected_tools=["search", "book_flight", "confirm"],
)
print(f"Tool selection accuracy: {result.score:.2f}")

Custom Metrics

from rotalabs_eval.metrics.custom import create_custom_metric

# Create a metric from a function
def word_count_ratio(prediction: str, reference: str) -> float:
    pred_words = len(prediction.split())
    ref_words = len(reference.split())
    return min(pred_words, ref_words) / max(pred_words, ref_words) if ref_words else 0.0

WordCountRatio = create_custom_metric("word_count_ratio", word_count_ratio)
metric = WordCountRatio()
print(metric.compute("hello world", "hello beautiful world"))

Caching Responses

from rotalabs_eval.cache import MemoryCache, DiskCache

# In-memory LRU cache
cache = MemoryCache(max_size=1000)
cache.put("key1", {"response": "cached value"})
print(cache.get("key1"))

# Persistent SQLite cache
disk_cache = DiskCache(cache_dir="./eval_cache")
disk_cache.put("key1", {"response": "persisted value"})

Available Metrics

Lexical Metrics

Metric	Class	Description
`exact_match`	`ExactMatchMetric`	Exact string match (with optional normalization)
`f1`	`F1Metric`	Token-level F1 score
`bleu`	`BLEUMetric`	BLEU score for n-gram overlap
`rouge_l`	`ROUGELMetric`	ROUGE-L using longest common subsequence
`contains`	`ContainsMetric`	Check if reference appears in prediction
`length_ratio`	`LengthRatioMetric`	Length ratio between prediction and reference

Semantic Metrics

Metric	Class	Description
`bert_score`	`BERTScoreMetric`	Contextual embedding similarity
`embedding_similarity`	`EmbeddingSimilarityMetric`	Cosine similarity of sentence embeddings

LLM-as-Judge Metrics

Metric	Class	Description
`llm_judge`	`LLMJudge`	Single-answer grading with LLM
`pairwise_judge`	`PairwiseJudge`	Pairwise comparison between two models
`g_eval`	`GEval`	G-Eval framework for multi-aspect evaluation

RAG Metrics

Metric	Class	Description
`context_relevance`	`ContextRelevanceMetric`	Relevance of retrieved context to query
`faithfulness`	`FaithfulnessMetric`	Whether answer is grounded in context
`answer_relevance`	`AnswerRelevanceMetric`	Relevance of answer to the question

Statistical Analysis

Confidence Intervals

bootstrap_ci() - Percentile bootstrap
bootstrap_ci_bca() - Bias-corrected and accelerated bootstrap
analytical_ci_mean() - t-distribution CI for means
analytical_ci_proportion() - Wilson/Normal/Clopper-Pearson for proportions

Significance Tests

paired_ttest() - Paired t-test for continuous metrics
mcnemar_test() - McNemar's test for binary outcomes
wilcoxon_signed_rank() - Non-parametric alternative
bootstrap_significance() - Bootstrap permutation test
choose_test() - Auto-select appropriate test based on data

Effect Sizes

cohens_d() - Standardized mean difference
hedges_g() - Small-sample corrected Cohen's d
odds_ratio() - Odds ratio for binary outcomes
relative_improvement() - Percentage improvement

Power Analysis

sample_size_for_mean_diff() - Required n for detecting mean differences
sample_size_for_proportion_diff() - Required n for proportion differences
compute_power() - Statistical power for a given sample size

Development

# Clone and install in development mode
git clone https://github.com/rotalabs/rotalabs-eval.git
cd rotalabs-eval
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Format code
black src/ tests/
ruff check src/ tests/

License

MIT License - see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.0

Feb 7, 2026

This version

1.0.0

Jan 31, 2026

0.1.0

Jan 28, 2026

0.0.1

Jan 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rotalabs_eval-1.0.0.tar.gz (207.6 kB view details)

Uploaded Jan 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rotalabs_eval-1.0.0-py3-none-any.whl (80.2 kB view details)

Uploaded Jan 31, 2026 Python 3

File details

Details for the file rotalabs_eval-1.0.0.tar.gz.

File metadata

Download URL: rotalabs_eval-1.0.0.tar.gz
Upload date: Jan 31, 2026
Size: 207.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for rotalabs_eval-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`41cdb628dd1f843c1c830504932a103a683052e7e588fc40573688bdb71d594f`
MD5	`94a832846e191802e674481730334954`
BLAKE2b-256	`a191f7d3033545ff45e6a8e6bfa926715173b28001ea754b20e6f6f0acf7f04a`

See more details on using hashes here.

File details

Details for the file rotalabs_eval-1.0.0-py3-none-any.whl.

File metadata

Download URL: rotalabs_eval-1.0.0-py3-none-any.whl
Upload date: Jan 31, 2026
Size: 80.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for rotalabs_eval-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7880bfd9b44bb673c08219bfa7a8d7e8a6c4646ab19e53dadc8d9b05fb1f8dd8`
MD5	`cc279dc0224bdb5753b39d01ba1d8dd0`
BLAKE2b-256	`9f065544f516541825f02c924606bd8e5c1e86a7b1326a3d786a38901387bb7c`

See more details on using hashes here.

rotalabs-eval 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

rotalabs-eval

Overview

Key Features

Installation

Basic Installation

With Optional Dependencies

Quick Start

Define and Run an Evaluation

Use Individual Metrics

Statistical Comparisons

Power Analysis

Agent Evaluation

Custom Metrics

Caching Responses

Available Metrics

Lexical Metrics

Semantic Metrics

LLM-as-Judge Metrics

RAG Metrics

Statistical Analysis

Confidence Intervals

Significance Tests

Effect Sizes

Power Analysis

Development

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes