Open evaluation kit for LLM systems

These details have not been verified by PyPI

Project links

Project description

OpenEvalKit

Production-grade Python framework for evaluating LLM systems with traditional scorers, LLM judges (OpenAI, Anthropic, Ollama, 100+ models via LiteLLM), ensemble aggregation, and smart caching for cost-effective testing.

Why OpenEvalKit?
Quick Start
Installation
- From PyPI
- From Source
Features
Examples
Configuration
Built-in Scorers
Supported Models
Custom Scorers
Contributing
License

Why OpenEvalKit?

Production Ready: Smart caching with LRU eviction, parallel execution, comprehensive error handling
Cost Effective: Intelligent caching avoids redundant LLM API calls, saving you money
Flexible Model Support: Works with 100+ models via LiteLLM - OpenAI, Anthropic, Google, local models via Ollama
Reliable Evaluation: Ensemble judges with configurable aggregation methods (average, median, majority vote, unanimous)
Developer Friendly: Clean API, extensive documentation, comprehensive type hints
Battle Tested: Comprehensive test suite, proven in production environments

Quick Start

Evaluate your LLM outputs in just a few lines:

from openevalkit import Dataset, Run, evaluate
from openevalkit.scorers import ExactMatch, BLEU, TokenF1

# Create your dataset
dataset = Dataset([
    Run(id="1", input="What is 2+2?", output="4", reference="4"),
    Run(id="2", input="Capital of France?", output="Paris", reference="Paris"),
    Run(id="3", input="Translate hello", output="hola amigo", reference="hola"),
])

# Evaluate with multiple scorers
results = evaluate(
    dataset,
    scorers=[ExactMatch(), BLEU(max_n=2), TokenF1()]
)
print(results.aggregates)
# {'exact_match': 0.6667, 'bleu_mean': 0.8165, 'token_f1_mean': 0.8889, ...}

With LLM Judges

from openevalkit import Dataset, Run, evaluate
from openevalkit.judges import LLMJudge, LLMConfig, Rubric

dataset = Dataset([
    Run(id="1", input="Explain Python", output="Python is a high-level programming language..."),
    Run(id="2", input="What is AI?", output="AI stands for Artificial Intelligence..."),
])

rubric = Rubric(
    criteria=["helpfulness", "accuracy", "clarity"],
    scale="0-1",
    weights={"helpfulness": 2.0, "accuracy": 3.0, "clarity": 1.0}
)

judge = LLMJudge(
    llm_config=LLMConfig(model="gpt-4o"),
    rubric=rubric
)

results = evaluate(dataset, judges=[judge])
print(results.aggregates)
# {'llm_judge_gpt-4o_overall_mean': 0.85, ...}

Installation

From PyPI

pip install openevalkit

Recommended: Use a virtual environment to avoid dependency conflicts:

python -m venv openevalkit_env
source openevalkit_env/bin/activate  # On Windows: openevalkit_env\Scripts\activate
pip install openevalkit

From Source

OpenEvalKit uses uv for fast dependency management:

# Clone the repository
git clone https://github.com/yonahgraphics/openevalkit.git
cd openevalkit

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies (includes dev dependencies)
uv sync --dev

Traditional pip installation:

git clone https://github.com/yonahgraphics/openevalkit.git
cd openevalkit
pip install -e .

Features

17 Built-in Scorers - Text similarity, token-level, structural, semantic, and performance metrics
LLM Judges - Evaluate quality with any LLM (100+ models supported)
Ensemble Judges - Combine multiple judges for reliable evaluation
Smart Caching - Automatic result caching with LRU eviction
Parallel Execution - Fast evaluation with configurable concurrency
Flexible Data Loading - JSONL, CSV, or in-memory datasets
Comprehensive Configuration - Timeouts, retries, seed control, progress bars

Examples

Traditional Scorers

from openevalkit import Dataset, evaluate
from openevalkit.scorers import ExactMatch, RegexMatch, ContainsKeywords, BLEU, TokenF1

# Create dataset
dataset = Dataset.from_jsonl(
    "data.jsonl",
    input_field="question",
    output_field="answer",
    reference_field="expected"
)

# Evaluate with multiple scorers
results = evaluate(
    dataset,
    scorers=[
        ExactMatch(),
        RegexMatch(pattern=r'\d+'),  # Contains numbers
        ContainsKeywords(keywords=["python", "code"], ignore_case=True),
        BLEU(),                      # N-gram precision
        TokenF1(),                   # Token-level F1
    ]
)

print(results.aggregates)
# {'exact_match': 0.85, 'bleu_mean': 0.72, 'token_f1_mean': 0.91, ...}

LLM Judges

from openevalkit.judges import LLMJudge, LLMConfig, Rubric

# Define what to evaluate
rubric = Rubric(
    criteria=["correctness", "clarity", "completeness"],
    scale="0-1",
    criteria_descriptions={
        "correctness": "Factually accurate with no errors",
        "clarity": "Easy to understand and well-structured",
        "completeness": "Addresses all aspects of the question"
    },
    weights={"correctness": 3.0, "clarity": 1.5, "completeness": 1.5}
)

# Use OpenAI
judge_gpt = LLMJudge(
    llm_config=LLMConfig(model="gpt-4o", temperature=0.0),
    rubric=rubric
)

# Use Anthropic
judge_claude = LLMJudge(
    llm_config=LLMConfig(model="claude-3-5-sonnet-20241022"),
    rubric=rubric
)

results = evaluate(dataset, judges=[judge_gpt, judge_claude])

Ensemble Evaluation

Combine multiple judges for more reliable scores:

from openevalkit.judges import EnsembleJudge

ensemble = EnsembleJudge(
    judges=[
        LLMJudge(LLMConfig(model="gpt-4o"), rubric),
        LLMJudge(LLMConfig(model="claude-3-5-sonnet-20241022"), rubric),
        LLMJudge(LLMConfig(model="gpt-4o-mini"), rubric),
    ],
    method="average",  # Options: "average", "median", "majority_vote", "unanimous"
    min_agreement=0.7,  # Warn if judges disagree
    n_jobs=3  # Parallel evaluation
)

results = evaluate(dataset, judges=[ensemble])

Using Local Models (Ollama)

Run evaluations completely offline with local models:

# First, install and start Ollama:
# curl -fsSL https://ollama.com/install.sh | sh
# ollama pull llama3
# ollama serve

judge = LLMJudge(
    llm_config=LLMConfig(
        model="ollama/llama3",
        api_base="http://localhost:11434"
    ),
    rubric=rubric
)

results = evaluate(dataset, judges=[judge])

Loading Data

From JSONL:

dataset = Dataset.from_jsonl(
    "data.jsonl",
    input_field="question",
    output_field="answer",
    reference_field="expected",
    metadata_fields=["user_id"],
    metrics_fields=["latency"]
)

From CSV:

dataset = Dataset.from_csv(
    "data.csv",
    input_col="question",
    output_col="answer",
    reference_col="expected"
)

From code:

from openevalkit import Run

dataset = Dataset([
    Run(id="1", input="Q1", output="A1", reference="A1"),
    Run(id="2", input="Q2", output="A2", reference="A2"),
])

Configuration

from openevalkit import EvalConfig

config = EvalConfig(
    # Execution
    concurrency=10,           # Parallel runs
    timeout=30.0,             # Timeout per run (seconds)
    
    # Reproducibility
    seed=42,                  # For deterministic results
    
    # Caching
    cache_enabled=True,       # Enable smart caching
    cache_max_size_mb=500,    # Cache size limit
    cache_max_age_days=30,    # Auto-cleanup old entries
    
    # Error handling
    fail_fast=False,          # Continue on errors
    
    # Output
    verbose=True,             # Show detailed progress
    progress_bar=True,        # Show progress bars
)

results = evaluate(dataset, judges=[judge], config=config)

Built-in Scorers

Reference-Based

ExactMatch - Exact string comparison with reference

Text Similarity

LevenshteinDistance - Normalized edit distance (0-1)
FuzzyMatch - Fuzzy string similarity via difflib
BLEU - N-gram precision with brevity penalty
ROUGE - ROUGE-L (longest common subsequence F1)

Token-Level

TokenF1 - Token overlap F1 score (precision/recall)
LengthRatio - Output-to-reference length ratio

Rule-Based

RegexMatch - Pattern matching with regex
ContainsKeywords - Check for required keywords
JSONValid - Validate JSON output
StartsWith - Check output prefix
EndsWith - Check output suffix
LengthCheck - Validate output length bounds

Performance Metrics

Latency - Response time from run.metrics
Cost - API cost from run.metrics
TokenCount - Token usage (exact or estimated)

Semantic

CosineSimilarity - Embedding-based semantic similarity (via litellm)

Supported Models

Via LiteLLM, OpenEvalKit supports 100+ models:

OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo
Anthropic: claude-3-5-sonnet, claude-3-opus, claude-3-haiku
Google: gemini-pro, gemini-1.5-pro, gemini-1.5-flash
Local (Ollama): llama3, mistral, phi, qwen
Cohere, Replicate, HuggingFace, and more

Custom Scorers

Create your own scorers:

from openevalkit.scorers.base import Scorer
from openevalkit import Score

class SentimentScorer(Scorer):
    name = "sentiment"
    requires_reference = False
    cacheable = True  # Cache expensive computations
    
    def score(self, run):
        # Your scoring logic here
        sentiment = analyze_sentiment(run.output)  # Your function
        return Score(
            value=sentiment,
            reason=f"Detected sentiment: {sentiment}",
            metadata={"analyzer": "custom"}
        )

# Use it
results = evaluate(dataset, scorers=[SentimentScorer()])

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Quick start for contributors:

# Clone and setup
git clone https://github.com/yonahgraphics/openevalkit.git
cd openevalkit
uv sync --dev

# Run tests
uv run pytest tests/

# Format code
ruff check .

License

MIT License - see LICENSE for details.

Made with love for the LLM and Agent evaluation community

Star us on GitHub if you find OpenEvalKit useful!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.7

Apr 4, 2026

0.1.6

Apr 4, 2026

0.1.5

Mar 1, 2026

0.1.4

Feb 28, 2026

0.1.3

Feb 28, 2026

0.1.2

Feb 28, 2026

0.1.1

Feb 28, 2026

0.1.0

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openevalkit-0.1.7.tar.gz (30.8 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

openevalkit-0.1.7-py3-none-any.whl (42.6 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file openevalkit-0.1.7.tar.gz.

File metadata

Download URL: openevalkit-0.1.7.tar.gz
Upload date: Apr 4, 2026
Size: 30.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openevalkit-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`c710ece85aff83856700be6f788cb95bc532fec7afd1203b3bc1e0b12ac76491`
MD5	`5c8e994eb73ad2d3edc0db63e55709b4`
BLAKE2b-256	`ce324a6687c712fbf8ef6feb364d3c9f17def862c3840f9598776116b2c9ee71`

See more details on using hashes here.

File details

Details for the file openevalkit-0.1.7-py3-none-any.whl.

File metadata

Download URL: openevalkit-0.1.7-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 42.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openevalkit-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`97a002042b932dad399f07f75ee4da2a72b439c1ee64ed7c5e62682c8060072d`
MD5	`420d7683cc0460722c4172bf14f0cabd`
BLAKE2b-256	`fd2ccadd4169b45e540b7556af208c91f2e1d7148cf2dc2900faec4bae5864c5`

See more details on using hashes here.

openevalkit 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OpenEvalKit

Table of Contents

Why OpenEvalKit?

Quick Start

With LLM Judges

Installation

From PyPI

From Source

Features

Examples

Traditional Scorers

LLM Judges

Ensemble Evaluation

Using Local Models (Ollama)

Loading Data

Configuration

Built-in Scorers

Reference-Based

Text Similarity

Token-Level

Rule-Based

Performance Metrics

Semantic

Supported Models

Custom Scorers

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes