Open evaluation kit for LLM systems

These details have not been verified by PyPI

Project description

OpenEvalKit

Universal evaluation framework for LLM systems

OpenEvalKit is a production-grade framework for evaluating LLM systems with traditional metrics, LLM-as-a-judge, and ensemble evaluation.

Features

📊 Traditional Scorers - ExactMatch, Latency, Cost, TokenCount, RegexMatch, JSONValid, ContainsKeywords
🤖 LLM Judges - Use any LLM (OpenAI, Anthropic, Ollama, 100+ models) to evaluate quality
🎯 Ensemble Judges - Combine multiple judges for more reliable evaluation
💾 Smart Caching - Automatic caching with LRU eviction (saves API costs)
⚡ Parallel Execution - Fast evaluation with configurable concurrency
🔧 Flexible - Custom scorers, judges, and rubrics

Installation

pip install openevalkit

Quick Start

Loading Datasets

from openevalkit import Dataset

# From JSONL
dataset = Dataset.from_jsonl(
    "data.jsonl",
    input_field="question",
    output_field="answer",
    reference_field="expected"
)

# From CSV
dataset = Dataset.from_csv(
    "data.csv",
    input_col="question",
    output_col="answer",
    reference_col="expected",
    metadata_cols=["user_id"],
    metrics_cols=["latency"]
)

# From list
from openevalkit import Run
dataset = Dataset([
    Run(id="1", input="What is 2+2?", output="4", reference="4"),
    Run(id="2", input="What is 3+3?", output="6", reference="6"),
])

Evaluate with Traditional Scorers

from openevalkit import evaluate
from openevalkit.scorers import ExactMatch, RegexMatch, JSONValid, ContainsKeywords

# Exact match
results = evaluate(dataset, scorers=[ExactMatch()])
print(results.aggregates)
# {'exact_match': 1.0}

# Regex pattern matching
scorer = RegexMatch(pattern=r'\d+')  # Check if output contains numbers
results = evaluate(dataset, scorers=[scorer])

# JSON validation
json_scorer = JSONValid()
results = evaluate(dataset, scorers=[json_scorer])

# Keyword detection
keyword_scorer = ContainsKeywords(keywords=["python", "code"], ignore_case=True)
results = evaluate(dataset, scorers=[keyword_scorer])

Evaluate with LLM Judge

from openevalkit.judges import LLMJudge, LLMConfig, Rubric

# Create dataset
dataset = Dataset([
    {"input": "Explain Python", "output": "Python is a programming language..."},
])

# Create rubric
rubric = Rubric(
    criteria=["helpfulness", "accuracy", "clarity"],
    scale="0-1",
    weights={"helpfulness": 2.0, "accuracy": 3.0, "clarity": 1.0}
)

# Create judge
judge = LLMJudge(
    llm_config=LLMConfig(model="gpt-4o"),
    rubric=rubric
)

# Evaluate
results = evaluate(dataset, judges=[judge])
print(results.aggregates)
# {'llm_judge_gpt-4o_score': 0.85, 'llm_judge_gpt-4o_helpfulness': 0.9, ...}

Ensemble Evaluation (Multiple Judges)

from openevalkit.judges import EnsembleJudge

# Combine multiple judges for more reliable evaluation
ensemble = EnsembleJudge(
    judges=[
        LLMJudge(LLMConfig(model="gpt-4o"), rubric),
        LLMJudge(LLMConfig(model="claude-3-5-sonnet-20241022"), rubric),
        LLMJudge(LLMConfig(model="gpt-4o-mini"), rubric),
    ],
    method="average",  # or "median", "majority_vote", "unanimous"
    n_jobs=3  # Parallel execution
)

results = evaluate(dataset, judges=[ensemble])

Configuration

from openevalkit import EvalConfig

config = EvalConfig(
    concurrency=10,           # Parallel runs
    cache_enabled=True,       # Cache results (saves API costs)
    cache_max_size_mb=500,    # Cache size limit
    timeout=30.0,             # Timeout per run
    seed=42,                  # Reproducible results
    verbose=True,             # Show progress
)

results = evaluate(dataset, judges=[judge], config=config)

Built-in Scorers

String Matching

ExactMatch - Exact string comparison with reference
RegexMatch - Pattern matching with regex
ContainsKeywords - Check for required keywords

Structure Validation

JSONValid - Validate JSON output

Performance Metrics

Latency - Response time from run.metrics
Cost - API cost from run.metrics
TokenCount - Token usage (exact or estimated)

Supported Models

Via LiteLLM, supports 100+ models:

OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo
Anthropic: claude-3-5-sonnet, claude-3-opus, claude-3-haiku
Google: gemini-pro, gemini-1.5-pro
Ollama: llama3, mistral, phi (local models)
Cohere, Replicate, HuggingFace, and more

Custom Scorers

from openevalkit.scorers.base import Scorer
from openevalkit import Score

class ContainsWord(Scorer):
    name = "contains_word"
    requires_reference = False
    
    def __init__(self, word: str):
        self.word = word
    
    def score(self, run):
        has_word = self.word.lower() in run.output.lower()
        return Score(
            value=1.0 if has_word else 0.0,
            reason=f"Word '{self.word}' {'found' if has_word else 'not found'}"
        )

results = evaluate(dataset, scorers=[ContainsWord("Python")])

Why OpenEvalKit?

Production Ready: Smart caching, parallel execution, error handling
Cost Effective: Cache LLM judgments to avoid redundant API calls
Flexible: Works with any LLM provider via LiteLLM
Reliable: Ensemble judges with configurable aggregation
Simple: Clean API, comprehensive documentation

Documentation

Coming soon! For now, see examples above and docstrings.

License

MIT

Contributing

Contributions welcome! Please open an issue or PR.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.7

Apr 4, 2026

0.1.6

Apr 4, 2026

0.1.5

Mar 1, 2026

0.1.4

Feb 28, 2026

0.1.3

Feb 28, 2026

0.1.2

Feb 28, 2026

0.1.1

Feb 28, 2026

This version

0.1.0

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openevalkit-0.1.0.tar.gz (24.3 kB view details)

Uploaded Feb 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

openevalkit-0.1.0-py3-none-any.whl (34.4 kB view details)

Uploaded Feb 28, 2026 Python 3

File details

Details for the file openevalkit-0.1.0.tar.gz.

File metadata

Download URL: openevalkit-0.1.0.tar.gz
Upload date: Feb 28, 2026
Size: 24.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openevalkit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`08002c84456204e649267e9370927d6266cc44dabea141e24608f623bc52f7b7`
MD5	`8ee9bf940875677e072683077a432a7e`
BLAKE2b-256	`9abbf96124de196c78b656a54615eb3bab9ac51b49417706430df0247691b71f`

See more details on using hashes here.

File details

Details for the file openevalkit-0.1.0-py3-none-any.whl.

File metadata

Download URL: openevalkit-0.1.0-py3-none-any.whl
Upload date: Feb 28, 2026
Size: 34.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openevalkit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1896dc004d604abe971b1ed0951c2cc5f3f221f187ec92c708d3faf285d57764`
MD5	`e483522715ac1f5a9132ddf46913d02a`
BLAKE2b-256	`82443d6d921302009a569e561085a56561f7950312578ad731c51d9de9f1a28f`

See more details on using hashes here.

openevalkit 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

OpenEvalKit

Features

Installation

Quick Start

Loading Datasets

Evaluate with Traditional Scorers

Evaluate with LLM Judge

Ensemble Evaluation (Multiple Judges)

Configuration

Built-in Scorers

String Matching

Structure Validation

Performance Metrics

Supported Models

Custom Scorers

Why OpenEvalKit?

Documentation

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes