Skip to main content

Open evaluation kit for LLM systems

Project description

OpenEvalKit

Universal evaluation framework for LLM systems

PyPI Python

OpenEvalKit is a production-grade framework for evaluating LLM systems with traditional metrics, LLM-as-a-judge, and ensemble evaluation.

Features

  • 📊 Traditional Scorers - ExactMatch, Latency, Cost, TokenCount, RegexMatch, JSONValid, ContainsKeywords
  • 🤖 LLM Judges - Use any LLM (OpenAI, Anthropic, Ollama, 100+ models) to evaluate quality
  • 🎯 Ensemble Judges - Combine multiple judges for more reliable evaluation
  • 💾 Smart Caching - Automatic caching with LRU eviction (saves API costs)
  • Parallel Execution - Fast evaluation with configurable concurrency
  • 🔧 Flexible - Custom scorers, judges, and rubrics

Installation

pip install openevalkit

Quick Start

Loading Datasets

from openevalkit import Dataset

# From JSONL
dataset = Dataset.from_jsonl(
    "data.jsonl",
    input_field="question",
    output_field="answer",
    reference_field="expected"
)

# From CSV
dataset = Dataset.from_csv(
    "data.csv",
    input_col="question",
    output_col="answer",
    reference_col="expected",
    metadata_cols=["user_id"],
    metrics_cols=["latency"]
)

# From list
from openevalkit import Run
dataset = Dataset([
    Run(id="1", input="What is 2+2?", output="4", reference="4"),
    Run(id="2", input="What is 3+3?", output="6", reference="6"),
])

Evaluate with Traditional Scorers

from openevalkit.evaluate import evaluate
from openevalkit.scorers import ExactMatch, RegexMatch, JSONValid, ContainsKeywords

# Exact match
results = evaluate(dataset, scorers=[ExactMatch()])
print(results.aggregates)
# {'exact_match': 1.0}

# Regex pattern matching
scorer = RegexMatch(pattern=r'\d+')  # Check if output contains numbers
results = evaluate(dataset, scorers=[scorer])

# JSON validation
json_scorer = JSONValid()
results = evaluate(dataset, scorers=[json_scorer])

# Keyword detection
keyword_scorer = ContainsKeywords(keywords=["python", "code"], ignore_case=True)
results = evaluate(dataset, scorers=[keyword_scorer])

Evaluate with LLM Judge

from openevalkit.judges import LLMJudge, LLMConfig, Rubric

# Create dataset
dataset = Dataset([
    {"input": "Explain Python", "output": "Python is a programming language..."},
])

# Create rubric
rubric = Rubric(
    criteria=["helpfulness", "accuracy", "clarity"],
    scale="0-1",
    weights={"helpfulness": 2.0, "accuracy": 3.0, "clarity": 1.0}
)

# Create judge
judge = LLMJudge(
    llm_config=LLMConfig(model="gpt-4o"),
    rubric=rubric
)

# Evaluate
results = evaluate(dataset, judges=[judge])
print(results.aggregates)
# {'llm_judge_gpt-4o_score': 0.85, 'llm_judge_gpt-4o_helpfulness': 0.9, ...}

Ensemble Evaluation (Multiple Judges)

from openevalkit.judges import EnsembleJudge

# Combine multiple judges for more reliable evaluation
ensemble = EnsembleJudge(
    judges=[
        LLMJudge(LLMConfig(model="gpt-4o"), rubric),
        LLMJudge(LLMConfig(model="claude-3-5-sonnet-20241022"), rubric),
        LLMJudge(LLMConfig(model="gpt-4o-mini"), rubric),
    ],
    method="average",  # or "median", "majority_vote", "unanimous"
    n_jobs=3  # Parallel execution
)

results = evaluate(dataset, judges=[ensemble])

Configuration

from openevalkit import EvalConfig

config = EvalConfig(
    concurrency=10,           # Parallel runs
    cache_enabled=True,       # Cache results (saves API costs)
    cache_max_size_mb=500,    # Cache size limit
    timeout=30.0,             # Timeout per run
    seed=42,                  # Reproducible results
    verbose=True,             # Show progress
)

results = evaluate(dataset, judges=[judge], config=config)

Built-in Scorers

String Matching

  • ExactMatch - Exact string comparison with reference
  • RegexMatch - Pattern matching with regex
  • ContainsKeywords - Check for required keywords

Structure Validation

  • JSONValid - Validate JSON output

Performance Metrics

  • Latency - Response time from run.metrics
  • Cost - API cost from run.metrics
  • TokenCount - Token usage (exact or estimated)

Supported Models

Via LiteLLM, supports 100+ models:

  • OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo
  • Anthropic: claude-3-5-sonnet, claude-3-opus, claude-3-haiku
  • Google: gemini-pro, gemini-1.5-pro
  • Ollama: llama3, mistral, phi (local models)
  • Cohere, Replicate, HuggingFace, and more

Custom Scorers

from openevalkit.scorers.base import Scorer
from openevalkit import Score

class ContainsWord(Scorer):
    name = "contains_word"
    requires_reference = False
    
    def __init__(self, word: str):
        self.word = word
    
    def score(self, run):
        has_word = self.word.lower() in run.output.lower()
        return Score(
            value=1.0 if has_word else 0.0,
            reason=f"Word '{self.word}' {'found' if has_word else 'not found'}"
        )

results = evaluate(dataset, scorers=[ContainsWord("Python")])

Why OpenEvalKit?

  • Production Ready: Smart caching, parallel execution, error handling
  • Cost Effective: Cache LLM judgments to avoid redundant API calls
  • Flexible: Works with any LLM provider via LiteLLM
  • Reliable: Ensemble judges with configurable aggregation
  • Simple: Clean API, comprehensive documentation

Documentation

Coming soon! For now, see examples above and docstrings.

License

MIT

Contributing

Contributions welcome! Please open an issue or PR.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openevalkit-0.1.1.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openevalkit-0.1.1-py3-none-any.whl (34.4 kB view details)

Uploaded Python 3

File details

Details for the file openevalkit-0.1.1.tar.gz.

File metadata

  • Download URL: openevalkit-0.1.1.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openevalkit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ac1254a4c6cd1ee85304c0528e679ff2470bc8cd5e60eb5400a097a708afa3b3
MD5 58a5e598f3b1e221f872feff71730bf8
BLAKE2b-256 844ed56e55ac9280c0a4c05652f56c5dbd8abd7333f49c2667530726d8482266

See more details on using hashes here.

File details

Details for the file openevalkit-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: openevalkit-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 34.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openevalkit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 06f982310f2237d1c47695047eab579f5ee1acfc183134ae7d8fd49434cc7609
MD5 34e916f02e411f387f14182276c6fb9e
BLAKE2b-256 bacb8d09acab7c9ff89e407ac12ea4248dac801278555d9e9bf45d2907ca57b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page