Skip to main content

Open evaluation kit for LLM systems

Project description

OpenEvalKit

Universal evaluation framework for LLM systems

PyPI Python

OpenEvalKit is a production-grade framework for evaluating LLM systems with traditional metrics, LLM-as-a-judge, and ensemble evaluation.

Features

  • 📊 Traditional Scorers - ExactMatch, Latency, Cost, TokenCount, RegexMatch, JSONValid, ContainsKeywords
  • 🤖 LLM Judges - Use any LLM (OpenAI, Anthropic, Ollama, 100+ models) to evaluate quality
  • 🎯 Ensemble Judges - Combine multiple judges for more reliable evaluation
  • 💾 Smart Caching - Automatic caching with LRU eviction (saves API costs)
  • Parallel Execution - Fast evaluation with configurable concurrency
  • 🔧 Flexible - Custom scorers, judges, and rubrics

Installation

pip install openevalkit

Quick Start

Loading Datasets

from openevalkit import Dataset

# From JSONL
dataset = Dataset.from_jsonl(
    "data.jsonl",
    input_field="question",
    output_field="answer",
    reference_field="expected"
)

# From CSV
dataset = Dataset.from_csv(
    "data.csv",
    input_col="question",
    output_col="answer",
    reference_col="expected",
    metadata_cols=["user_id"],
    metrics_cols=["latency"]
)

# From list
from openevalkit import Run
dataset = Dataset([
    Run(id="1", input="What is 2+2?", output="4", reference="4"),
    Run(id="2", input="What is 3+3?", output="6", reference="6"),
])

Evaluate with Traditional Scorers

from openevalkit import evaluate
from openevalkit.scorers import ExactMatch, RegexMatch, JSONValid, ContainsKeywords

# Exact match
results = evaluate(dataset, scorers=[ExactMatch()])
print(results.aggregates)
# {'exact_match': 1.0}

# Regex pattern matching
scorer = RegexMatch(pattern=r'\d+')  # Check if output contains numbers
results = evaluate(dataset, scorers=[scorer])

# JSON validation
json_scorer = JSONValid()
results = evaluate(dataset, scorers=[json_scorer])

# Keyword detection
keyword_scorer = ContainsKeywords(keywords=["python", "code"], ignore_case=True)
results = evaluate(dataset, scorers=[keyword_scorer])

Evaluate with LLM Judge

from openevalkit.judges import LLMJudge, LLMConfig, Rubric

# Create dataset
dataset = Dataset([
    {"input": "Explain Python", "output": "Python is a programming language..."},
])

# Create rubric
rubric = Rubric(
    criteria=["helpfulness", "accuracy", "clarity"],
    scale="0-1",
    weights={"helpfulness": 2.0, "accuracy": 3.0, "clarity": 1.0}
)

# Create judge
judge = LLMJudge(
    llm_config=LLMConfig(model="gpt-4o"),
    rubric=rubric
)

# Evaluate
results = evaluate(dataset, judges=[judge])
print(results.aggregates)
# {'llm_judge_gpt-4o_score': 0.85, 'llm_judge_gpt-4o_helpfulness': 0.9, ...}

Ensemble Evaluation (Multiple Judges)

from openevalkit.judges import EnsembleJudge

# Combine multiple judges for more reliable evaluation
ensemble = EnsembleJudge(
    judges=[
        LLMJudge(LLMConfig(model="gpt-4o"), rubric),
        LLMJudge(LLMConfig(model="claude-3-5-sonnet-20241022"), rubric),
        LLMJudge(LLMConfig(model="gpt-4o-mini"), rubric),
    ],
    method="average",  # or "median", "majority_vote", "unanimous"
    n_jobs=3  # Parallel execution
)

results = evaluate(dataset, judges=[ensemble])

Configuration

from openevalkit import EvalConfig

config = EvalConfig(
    concurrency=10,           # Parallel runs
    cache_enabled=True,       # Cache results (saves API costs)
    cache_max_size_mb=500,    # Cache size limit
    timeout=30.0,             # Timeout per run
    seed=42,                  # Reproducible results
    verbose=True,             # Show progress
)

results = evaluate(dataset, judges=[judge], config=config)

Built-in Scorers

String Matching

  • ExactMatch - Exact string comparison with reference
  • RegexMatch - Pattern matching with regex
  • ContainsKeywords - Check for required keywords

Structure Validation

  • JSONValid - Validate JSON output

Performance Metrics

  • Latency - Response time from run.metrics
  • Cost - API cost from run.metrics
  • TokenCount - Token usage (exact or estimated)

Supported Models

Via LiteLLM, supports 100+ models:

  • OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo
  • Anthropic: claude-3-5-sonnet, claude-3-opus, claude-3-haiku
  • Google: gemini-pro, gemini-1.5-pro
  • Ollama: llama3, mistral, phi (local models)
  • Cohere, Replicate, HuggingFace, and more

Custom Scorers

from openevalkit.scorers.base import Scorer
from openevalkit import Score

class ContainsWord(Scorer):
    name = "contains_word"
    requires_reference = False
    
    def __init__(self, word: str):
        self.word = word
    
    def score(self, run):
        has_word = self.word.lower() in run.output.lower()
        return Score(
            value=1.0 if has_word else 0.0,
            reason=f"Word '{self.word}' {'found' if has_word else 'not found'}"
        )

results = evaluate(dataset, scorers=[ContainsWord("Python")])

Why OpenEvalKit?

  • Production Ready: Smart caching, parallel execution, error handling
  • Cost Effective: Cache LLM judgments to avoid redundant API calls
  • Flexible: Works with any LLM provider via LiteLLM
  • Reliable: Ensemble judges with configurable aggregation
  • Simple: Clean API, comprehensive documentation

Documentation

Coming soon! For now, see examples above and docstrings.

License

MIT

Contributing

Contributions welcome! Please open an issue or PR.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openevalkit-0.1.0.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openevalkit-0.1.0-py3-none-any.whl (34.4 kB view details)

Uploaded Python 3

File details

Details for the file openevalkit-0.1.0.tar.gz.

File metadata

  • Download URL: openevalkit-0.1.0.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openevalkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 08002c84456204e649267e9370927d6266cc44dabea141e24608f623bc52f7b7
MD5 8ee9bf940875677e072683077a432a7e
BLAKE2b-256 9abbf96124de196c78b656a54615eb3bab9ac51b49417706430df0247691b71f

See more details on using hashes here.

File details

Details for the file openevalkit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: openevalkit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 34.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openevalkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1896dc004d604abe971b1ed0951c2cc5f3f221f187ec92c708d3faf285d57764
MD5 e483522715ac1f5a9132ddf46913d02a
BLAKE2b-256 82443d6d921302009a569e561085a56561f7950312578ad731c51d9de9f1a28f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page