Open evaluation kit for LLM systems
Project description
OpenEvalKit
Universal evaluation framework for LLM systems
OpenEvalKit is a production-grade framework for evaluating LLM systems with traditional metrics, LLM-as-a-judge, and ensemble evaluation.
Features
- 📊 Traditional Scorers - ExactMatch, Latency, Cost, TokenCount, RegexMatch, JSONValid, ContainsKeywords
- 🤖 LLM Judges - Use any LLM (OpenAI, Anthropic, Ollama, 100+ models) to evaluate quality
- 🎯 Ensemble Judges - Combine multiple judges for more reliable evaluation
- 💾 Smart Caching - Automatic caching with LRU eviction (saves API costs)
- ⚡ Parallel Execution - Fast evaluation with configurable concurrency
- 🔧 Flexible - Custom scorers, judges, and rubrics
Installation
pip install openevalkit
Quick Start
Loading Datasets
from openevalkit import Dataset
# From JSONL
dataset = Dataset.from_jsonl(
"data.jsonl",
input_field="question",
output_field="answer",
reference_field="expected"
)
# From CSV
dataset = Dataset.from_csv(
"data.csv",
input_col="question",
output_col="answer",
reference_col="expected",
metadata_cols=["user_id"],
metrics_cols=["latency"]
)
# From list
from openevalkit import Run
dataset = Dataset([
Run(id="1", input="What is 2+2?", output="4", reference="4"),
Run(id="2", input="What is 3+3?", output="6", reference="6"),
])
Evaluate with Traditional Scorers
from openevalkit.evaluate import evaluate
from openevalkit.scorers import ExactMatch, RegexMatch, JSONValid, ContainsKeywords
# Exact match
results = evaluate(dataset, scorers=[ExactMatch()])
print(results.aggregates)
# {'exact_match': 1.0}
# Regex pattern matching
scorer = RegexMatch(pattern=r'\d+') # Check if output contains numbers
results = evaluate(dataset, scorers=[scorer])
# JSON validation
json_scorer = JSONValid()
results = evaluate(dataset, scorers=[json_scorer])
# Keyword detection
keyword_scorer = ContainsKeywords(keywords=["python", "code"], ignore_case=True)
results = evaluate(dataset, scorers=[keyword_scorer])
Evaluate with LLM Judge
from openevalkit.judges import LLMJudge, LLMConfig, Rubric
# Create dataset
dataset = Dataset([
{"input": "Explain Python", "output": "Python is a programming language..."},
])
# Create rubric
rubric = Rubric(
criteria=["helpfulness", "accuracy", "clarity"],
scale="0-1",
weights={"helpfulness": 2.0, "accuracy": 3.0, "clarity": 1.0}
)
# Create judge
judge = LLMJudge(
llm_config=LLMConfig(model="gpt-4o"),
rubric=rubric
)
# Evaluate
results = evaluate(dataset, judges=[judge])
print(results.aggregates)
# {'llm_judge_gpt-4o_score': 0.85, 'llm_judge_gpt-4o_helpfulness': 0.9, ...}
Ensemble Evaluation (Multiple Judges)
from openevalkit.judges import EnsembleJudge
# Combine multiple judges for more reliable evaluation
ensemble = EnsembleJudge(
judges=[
LLMJudge(LLMConfig(model="gpt-4o"), rubric),
LLMJudge(LLMConfig(model="claude-3-5-sonnet-20241022"), rubric),
LLMJudge(LLMConfig(model="gpt-4o-mini"), rubric),
],
method="average", # or "median", "majority_vote", "unanimous"
n_jobs=3 # Parallel execution
)
results = evaluate(dataset, judges=[ensemble])
Configuration
from openevalkit import EvalConfig
config = EvalConfig(
concurrency=10, # Parallel runs
cache_enabled=True, # Cache results (saves API costs)
cache_max_size_mb=500, # Cache size limit
timeout=30.0, # Timeout per run
seed=42, # Reproducible results
verbose=True, # Show progress
)
results = evaluate(dataset, judges=[judge], config=config)
Built-in Scorers
String Matching
- ExactMatch - Exact string comparison with reference
- RegexMatch - Pattern matching with regex
- ContainsKeywords - Check for required keywords
Structure Validation
- JSONValid - Validate JSON output
Performance Metrics
- Latency - Response time from run.metrics
- Cost - API cost from run.metrics
- TokenCount - Token usage (exact or estimated)
Supported Models
Via LiteLLM, supports 100+ models:
- OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo
- Anthropic: claude-3-5-sonnet, claude-3-opus, claude-3-haiku
- Google: gemini-pro, gemini-1.5-pro
- Ollama: llama3, mistral, phi (local models)
- Cohere, Replicate, HuggingFace, and more
Custom Scorers
from openevalkit.scorers.base import Scorer
from openevalkit import Score
class ContainsWord(Scorer):
name = "contains_word"
requires_reference = False
def __init__(self, word: str):
self.word = word
def score(self, run):
has_word = self.word.lower() in run.output.lower()
return Score(
value=1.0 if has_word else 0.0,
reason=f"Word '{self.word}' {'found' if has_word else 'not found'}"
)
results = evaluate(dataset, scorers=[ContainsWord("Python")])
Why OpenEvalKit?
- Production Ready: Smart caching, parallel execution, error handling
- Cost Effective: Cache LLM judgments to avoid redundant API calls
- Flexible: Works with any LLM provider via LiteLLM
- Reliable: Ensemble judges with configurable aggregation
- Simple: Clean API, comprehensive documentation
Documentation
Coming soon! For now, see examples above and docstrings.
License
MIT
Contributing
Contributions welcome! Please open an issue or PR.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openevalkit-0.1.2.tar.gz.
File metadata
- Download URL: openevalkit-0.1.2.tar.gz
- Upload date:
- Size: 24.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
995e764399e27ff4df0555f612efdad0f461cadb9dba768b3f3d9ea02995fa99
|
|
| MD5 |
eb27bc7840f8386be36b668d1cbc4bd3
|
|
| BLAKE2b-256 |
4ddc2ae792488009f92b3a3b90e74790832b68f8bfbafca8adcf1ff7e71b3ad2
|
File details
Details for the file openevalkit-0.1.2-py3-none-any.whl.
File metadata
- Download URL: openevalkit-0.1.2-py3-none-any.whl
- Upload date:
- Size: 34.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1555f20573fe4b1deec3099fdb493147d6ffe01bd75e8c2418dbdd064ff8ca61
|
|
| MD5 |
4d562b3c6a45f0bf4ebe1702173be234
|
|
| BLAKE2b-256 |
8266b4e3d370ebc8cb68865aff1850f8aa53dad33d88a4c84e17b6120197e189
|