Open evaluation kit for LLM systems
Project description
OpenEvalKit
Production-grade Python framework for evaluating LLM systems with traditional scorers, LLM judges (OpenAI, Anthropic, Ollama, 100+ models via LiteLLM), ensemble aggregation, and smart caching for cost-effective testing.
Table of Contents
- Why OpenEvalKit?
- Quick Start
- Installation
- Features
- Examples
- Configuration
- Built-in Scorers
- Supported Models
- Custom Scorers
- Contributing
- License
Why OpenEvalKit?
- Production Ready: Smart caching with LRU eviction, parallel execution, comprehensive error handling
- Cost Effective: Intelligent caching avoids redundant LLM API calls, saving you money
- Flexible Model Support: Works with 100+ models via LiteLLM - OpenAI, Anthropic, Google, local models via Ollama
- Reliable Evaluation: Ensemble judges with configurable aggregation methods (average, median, majority vote, unanimous)
- Developer Friendly: Clean API, extensive documentation, comprehensive type hints
- Battle Tested: Comprehensive test suite, proven in production environments
Quick Start
Evaluate your LLM outputs in just a few lines:
from openevalkit import Dataset, Run, evaluate
from openevalkit.scorers import ExactMatch, BLEU, TokenF1
# Create your dataset
dataset = Dataset([
Run(id="1", input="What is 2+2?", output="4", reference="4"),
Run(id="2", input="Capital of France?", output="Paris", reference="Paris"),
Run(id="3", input="Translate hello", output="hola amigo", reference="hola"),
])
# Evaluate with multiple scorers
results = evaluate(
dataset,
scorers=[ExactMatch(), BLEU(max_n=2), TokenF1()]
)
print(results.aggregates)
# {'exact_match': 0.6667, 'bleu_mean': 0.8165, 'token_f1_mean': 0.8889, ...}
With LLM Judges
from openevalkit import Dataset, Run, evaluate
from openevalkit.judges import LLMJudge, LLMConfig, Rubric
dataset = Dataset([
Run(id="1", input="Explain Python", output="Python is a high-level programming language..."),
Run(id="2", input="What is AI?", output="AI stands for Artificial Intelligence..."),
])
rubric = Rubric(
criteria=["helpfulness", "accuracy", "clarity"],
scale="0-1",
weights={"helpfulness": 2.0, "accuracy": 3.0, "clarity": 1.0}
)
judge = LLMJudge(
llm_config=LLMConfig(model="gpt-4o"),
rubric=rubric
)
results = evaluate(dataset, judges=[judge])
print(results.aggregates)
# {'llm_judge_gpt-4o_overall_mean': 0.85, ...}
Installation
From PyPI
pip install openevalkit
Recommended: Use a virtual environment to avoid dependency conflicts:
python -m venv openevalkit_env
source openevalkit_env/bin/activate # On Windows: openevalkit_env\Scripts\activate
pip install openevalkit
From Source
OpenEvalKit uses uv for fast dependency management:
# Clone the repository
git clone https://github.com/yonahgraphics/openevalkit.git
cd openevalkit
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies (includes dev dependencies)
uv sync --dev
Traditional pip installation:
git clone https://github.com/yonahgraphics/openevalkit.git
cd openevalkit
pip install -e .
Features
- 17 Built-in Scorers - Text similarity, token-level, structural, semantic, and performance metrics
- LLM Judges - Evaluate quality with any LLM (100+ models supported)
- Ensemble Judges - Combine multiple judges for reliable evaluation
- Smart Caching - Automatic result caching with LRU eviction
- Parallel Execution - Fast evaluation with configurable concurrency
- Flexible Data Loading - JSONL, CSV, or in-memory datasets
- Comprehensive Configuration - Timeouts, retries, seed control, progress bars
Examples
Traditional Scorers
from openevalkit import Dataset, evaluate
from openevalkit.scorers import ExactMatch, RegexMatch, ContainsKeywords, BLEU, TokenF1
# Create dataset
dataset = Dataset.from_jsonl(
"data.jsonl",
input_field="question",
output_field="answer",
reference_field="expected"
)
# Evaluate with multiple scorers
results = evaluate(
dataset,
scorers=[
ExactMatch(),
RegexMatch(pattern=r'\d+'), # Contains numbers
ContainsKeywords(keywords=["python", "code"], ignore_case=True),
BLEU(), # N-gram precision
TokenF1(), # Token-level F1
]
)
print(results.aggregates)
# {'exact_match': 0.85, 'bleu_mean': 0.72, 'token_f1_mean': 0.91, ...}
LLM Judges
from openevalkit.judges import LLMJudge, LLMConfig, Rubric
# Define what to evaluate
rubric = Rubric(
criteria=["correctness", "clarity", "completeness"],
scale="0-1",
criteria_descriptions={
"correctness": "Factually accurate with no errors",
"clarity": "Easy to understand and well-structured",
"completeness": "Addresses all aspects of the question"
},
weights={"correctness": 3.0, "clarity": 1.5, "completeness": 1.5}
)
# Use OpenAI
judge_gpt = LLMJudge(
llm_config=LLMConfig(model="gpt-4o", temperature=0.0),
rubric=rubric
)
# Use Anthropic
judge_claude = LLMJudge(
llm_config=LLMConfig(model="claude-3-5-sonnet-20241022"),
rubric=rubric
)
results = evaluate(dataset, judges=[judge_gpt, judge_claude])
Ensemble Evaluation
Combine multiple judges for more reliable scores:
from openevalkit.judges import EnsembleJudge
ensemble = EnsembleJudge(
judges=[
LLMJudge(LLMConfig(model="gpt-4o"), rubric),
LLMJudge(LLMConfig(model="claude-3-5-sonnet-20241022"), rubric),
LLMJudge(LLMConfig(model="gpt-4o-mini"), rubric),
],
method="average", # Options: "average", "median", "majority_vote", "unanimous"
min_agreement=0.7, # Warn if judges disagree
n_jobs=3 # Parallel evaluation
)
results = evaluate(dataset, judges=[ensemble])
Using Local Models (Ollama)
Run evaluations completely offline with local models:
# First, install and start Ollama:
# curl -fsSL https://ollama.com/install.sh | sh
# ollama pull llama3
# ollama serve
judge = LLMJudge(
llm_config=LLMConfig(
model="ollama/llama3",
api_base="http://localhost:11434"
),
rubric=rubric
)
results = evaluate(dataset, judges=[judge])
Loading Data
From JSONL:
dataset = Dataset.from_jsonl(
"data.jsonl",
input_field="question",
output_field="answer",
reference_field="expected",
metadata_fields=["user_id"],
metrics_fields=["latency"]
)
From CSV:
dataset = Dataset.from_csv(
"data.csv",
input_col="question",
output_col="answer",
reference_col="expected"
)
From code:
from openevalkit import Run
dataset = Dataset([
Run(id="1", input="Q1", output="A1", reference="A1"),
Run(id="2", input="Q2", output="A2", reference="A2"),
])
Configuration
from openevalkit import EvalConfig
config = EvalConfig(
# Execution
concurrency=10, # Parallel runs
timeout=30.0, # Timeout per run (seconds)
# Reproducibility
seed=42, # For deterministic results
# Caching
cache_enabled=True, # Enable smart caching
cache_max_size_mb=500, # Cache size limit
cache_max_age_days=30, # Auto-cleanup old entries
# Error handling
fail_fast=False, # Continue on errors
# Output
verbose=True, # Show detailed progress
progress_bar=True, # Show progress bars
)
results = evaluate(dataset, judges=[judge], config=config)
Built-in Scorers
Reference-Based
- ExactMatch - Exact string comparison with reference
Text Similarity
- LevenshteinDistance - Normalized edit distance (0-1)
- FuzzyMatch - Fuzzy string similarity via
difflib - BLEU - N-gram precision with brevity penalty
- ROUGE - ROUGE-L (longest common subsequence F1)
Token-Level
- TokenF1 - Token overlap F1 score (precision/recall)
- LengthRatio - Output-to-reference length ratio
Rule-Based
- RegexMatch - Pattern matching with regex
- ContainsKeywords - Check for required keywords
- JSONValid - Validate JSON output
- StartsWith - Check output prefix
- EndsWith - Check output suffix
- LengthCheck - Validate output length bounds
Performance Metrics
- Latency - Response time from
run.metrics - Cost - API cost from
run.metrics - TokenCount - Token usage (exact or estimated)
Semantic
- CosineSimilarity - Embedding-based semantic similarity (via litellm)
Supported Models
Via LiteLLM, OpenEvalKit supports 100+ models:
- OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo
- Anthropic: claude-3-5-sonnet, claude-3-opus, claude-3-haiku
- Google: gemini-pro, gemini-1.5-pro, gemini-1.5-flash
- Local (Ollama): llama3, mistral, phi, qwen
- Cohere, Replicate, HuggingFace, and more
Custom Scorers
Create your own scorers:
from openevalkit.scorers.base import Scorer
from openevalkit import Score
class SentimentScorer(Scorer):
name = "sentiment"
requires_reference = False
cacheable = True # Cache expensive computations
def score(self, run):
# Your scoring logic here
sentiment = analyze_sentiment(run.output) # Your function
return Score(
value=sentiment,
reason=f"Detected sentiment: {sentiment}",
metadata={"analyzer": "custom"}
)
# Use it
results = evaluate(dataset, scorers=[SentimentScorer()])
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Quick start for contributors:
# Clone and setup
git clone https://github.com/yonahgraphics/openevalkit.git
cd openevalkit
uv sync --dev
# Run tests
uv run pytest tests/
# Format code
ruff check .
License
MIT License - see LICENSE for details.
Made with love for the LLM and Agent evaluation community
Star us on GitHub if you find OpenEvalKit useful!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openevalkit-0.1.6.tar.gz.
File metadata
- Download URL: openevalkit-0.1.6.tar.gz
- Upload date:
- Size: 30.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcd74321ea49c621c76ab9994426171779bfca8a2302b1887ad0bc5b77133a50
|
|
| MD5 |
61a072a571e7a2a09e4135ae0be58925
|
|
| BLAKE2b-256 |
b9c7bb1c5c39a108825b424583772e7921891924dc0b34f8bd69b0558ba43c3b
|
File details
Details for the file openevalkit-0.1.6-py3-none-any.whl.
File metadata
- Download URL: openevalkit-0.1.6-py3-none-any.whl
- Upload date:
- Size: 42.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9a95dbe5b9a388dc397ace61a58cd70f98520fd41b24c2bd6c92ddb47d0a168
|
|
| MD5 |
5f48ec6c249402fea8f9e35886e2e145
|
|
| BLAKE2b-256 |
12931b2d4fc12b697d07dc6ac5fdfedd7202626f29e405935fff1f9b753f9023
|