Skip to main content

Open evaluation kit for LLM systems

Project description

OpenEvalKit

Universal evaluation framework for LLM systems

PyPI Python

OpenEvalKit is a production-grade framework for evaluating LLM systems with traditional metrics, LLM-as-a-judge, and ensemble evaluation.

Features

  • 📊 Traditional Scorers - ExactMatch, Latency, Cost, TokenCount, RegexMatch, JSONValid, ContainsKeywords
  • 🤖 LLM Judges - Use any LLM (OpenAI, Anthropic, Ollama, 100+ models) to evaluate quality
  • 🎯 Ensemble Judges - Combine multiple judges for more reliable evaluation
  • 💾 Smart Caching - Automatic caching with LRU eviction (saves API costs)
  • Parallel Execution - Fast evaluation with configurable concurrency
  • 🔧 Flexible - Custom scorers, judges, and rubrics

Installation

pip install openevalkit

Quick Start

Loading Datasets

from openevalkit import Dataset

# From JSONL
dataset = Dataset.from_jsonl(
    "data.jsonl",
    input_field="question",
    output_field="answer",
    reference_field="expected"
)

# From CSV
dataset = Dataset.from_csv(
    "data.csv",
    input_col="question",
    output_col="answer",
    reference_col="expected",
    metadata_cols=["user_id"],
    metrics_cols=["latency"]
)

# From list
from openevalkit import Run
dataset = Dataset([
    Run(id="1", input="What is 2+2?", output="4", reference="4"),
    Run(id="2", input="What is 3+3?", output="6", reference="6"),
])

Evaluate with Traditional Scorers

from openevalkit.evaluate import evaluate
from openevalkit.scorers import ExactMatch, RegexMatch, JSONValid, ContainsKeywords

# Exact match
results = evaluate(dataset, scorers=[ExactMatch()])
print(results.aggregates)
# {'exact_match': 1.0}

# Regex pattern matching
scorer = RegexMatch(pattern=r'\d+')  # Check if output contains numbers
results = evaluate(dataset, scorers=[scorer])

# JSON validation
json_scorer = JSONValid()
results = evaluate(dataset, scorers=[json_scorer])

# Keyword detection
keyword_scorer = ContainsKeywords(keywords=["python", "code"], ignore_case=True)
results = evaluate(dataset, scorers=[keyword_scorer])

Evaluate with LLM Judge

from openevalkit.judges import LLMJudge, LLMConfig, Rubric

# Create dataset
dataset = Dataset([
    {"input": "Explain Python", "output": "Python is a programming language..."},
])

# Create rubric
rubric = Rubric(
    criteria=["helpfulness", "accuracy", "clarity"],
    scale="0-1",
    weights={"helpfulness": 2.0, "accuracy": 3.0, "clarity": 1.0}
)

# Create judge
judge = LLMJudge(
    llm_config=LLMConfig(model="gpt-4o"),
    rubric=rubric
)

# Evaluate
results = evaluate(dataset, judges=[judge])
print(results.aggregates)
# {'llm_judge_gpt-4o_score': 0.85, 'llm_judge_gpt-4o_helpfulness': 0.9, ...}

Ensemble Evaluation (Multiple Judges)

from openevalkit.judges import EnsembleJudge

# Combine multiple judges for more reliable evaluation
ensemble = EnsembleJudge(
    judges=[
        LLMJudge(LLMConfig(model="gpt-4o"), rubric),
        LLMJudge(LLMConfig(model="claude-3-5-sonnet-20241022"), rubric),
        LLMJudge(LLMConfig(model="gpt-4o-mini"), rubric),
    ],
    method="average",  # or "median", "majority_vote", "unanimous"
    n_jobs=3  # Parallel execution
)

results = evaluate(dataset, judges=[ensemble])

Configuration

from openevalkit import EvalConfig

config = EvalConfig(
    concurrency=10,           # Parallel runs
    cache_enabled=True,       # Cache results (saves API costs)
    cache_max_size_mb=500,    # Cache size limit
    timeout=30.0,             # Timeout per run
    seed=42,                  # Reproducible results
    verbose=True,             # Show progress
)

results = evaluate(dataset, judges=[judge], config=config)

Built-in Scorers

String Matching

  • ExactMatch - Exact string comparison with reference
  • RegexMatch - Pattern matching with regex
  • ContainsKeywords - Check for required keywords

Structure Validation

  • JSONValid - Validate JSON output

Performance Metrics

  • Latency - Response time from run.metrics
  • Cost - API cost from run.metrics
  • TokenCount - Token usage (exact or estimated)

Supported Models

Via LiteLLM, supports 100+ models:

  • OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo
  • Anthropic: claude-3-5-sonnet, claude-3-opus, claude-3-haiku
  • Google: gemini-pro, gemini-1.5-pro
  • Ollama: llama3, mistral, phi (local models)
  • Cohere, Replicate, HuggingFace, and more

Custom Scorers

from openevalkit.scorers.base import Scorer
from openevalkit import Score

class ContainsWord(Scorer):
    name = "contains_word"
    requires_reference = False
    
    def __init__(self, word: str):
        self.word = word
    
    def score(self, run):
        has_word = self.word.lower() in run.output.lower()
        return Score(
            value=1.0 if has_word else 0.0,
            reason=f"Word '{self.word}' {'found' if has_word else 'not found'}"
        )

results = evaluate(dataset, scorers=[ContainsWord("Python")])

Why OpenEvalKit?

  • Production Ready: Smart caching, parallel execution, error handling
  • Cost Effective: Cache LLM judgments to avoid redundant API calls
  • Flexible: Works with any LLM provider via LiteLLM
  • Reliable: Ensemble judges with configurable aggregation
  • Simple: Clean API, comprehensive documentation

Documentation

Coming soon! For now, see examples above and docstrings.

License

MIT

Contributing

Contributions welcome! Please open an issue or PR.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openevalkit-0.1.2.tar.gz (24.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openevalkit-0.1.2-py3-none-any.whl (34.4 kB view details)

Uploaded Python 3

File details

Details for the file openevalkit-0.1.2.tar.gz.

File metadata

  • Download URL: openevalkit-0.1.2.tar.gz
  • Upload date:
  • Size: 24.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openevalkit-0.1.2.tar.gz
Algorithm Hash digest
SHA256 995e764399e27ff4df0555f612efdad0f461cadb9dba768b3f3d9ea02995fa99
MD5 eb27bc7840f8386be36b668d1cbc4bd3
BLAKE2b-256 4ddc2ae792488009f92b3a3b90e74790832b68f8bfbafca8adcf1ff7e71b3ad2

See more details on using hashes here.

File details

Details for the file openevalkit-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: openevalkit-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 34.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openevalkit-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1555f20573fe4b1deec3099fdb493147d6ffe01bd75e8c2418dbdd064ff8ca61
MD5 4d562b3c6a45f0bf4ebe1702173be234
BLAKE2b-256 8266b4e3d370ebc8cb68865aff1850f8aa53dad33d88a4c84e17b6120197e189

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page