Skip to main content

LLM evaluation framework with ELO ratings, arena battles, and benchmark testing

Project description

promptmachine-eval

promptmachine-eval

LLM Evaluation Framework
ELO ratings • Arena battles • Benchmark testing • Cost tracking

PyPI Python CI Coverage License

DocumentationLive LeaderboardArenaFrame.dev


Overview

promptmachine-eval is a Python toolkit for evaluating and comparing Large Language Models. Built by Frame.dev as part of PromptMachine.

Key Features

  • 🏆 ELO Rating System — Chess-style ratings for fair LLM comparisons
  • ⚔️ Arena Battles — Head-to-head comparisons with LLM-as-judge
  • 📊 Benchmarks — Run standard evals (MMLU, GSM8K, HumanEval)
  • 🎯 Smart Matchmaking — Monte Carlo sampling for informative pairings
  • 💰 Cost Tracking — Real-time token counting and spend estimation
  • 📈 Reports — Generate Markdown evaluation reports

Installation

pip install promptmachine-eval

For development:

pip install promptmachine-eval[dev]

Quick Start

CLI Usage

# Set your API keys
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...

# Test a prompt across models
pm-eval test "Explain quantum computing simply" \
  --models gpt-4o-mini,claude-3-5-haiku

# Run a head-to-head battle
pm-eval battle "Write a haiku about coding" \
  -a gpt-4o -b claude-3-5-sonnet

# Estimate costs before running
pm-eval cost "Your long prompt..." \
  --models gpt-4o,gpt-4o-mini,claude-3-5-sonnet

# List all supported models and pricing
pm-eval models

Python API

import asyncio
from promptmachine_eval import EloCalculator, BattleRunner, PromptTester

# --- ELO Calculations ---
elo = EloCalculator()

# Calculate rating changes after a battle
new_a, new_b = elo.update_ratings(
    rating_a=1200,
    rating_b=1000,
    score_a=1.0  # A wins
)
print(f"New ratings: A={new_a:.0f}, B={new_b:.0f}")

# --- Run Arena Battle ---
runner = BattleRunner(
    openai_api_key="sk-...",
    anthropic_api_key="sk-ant-..."
)

result = asyncio.run(runner.battle(
    prompt="Write a function to reverse a linked list",
    model_a="gpt-4o",
    model_b="claude-3-5-sonnet",
    judge_model="gpt-4o-mini"
))

print(f"Winner: {result.winner}")
print(f"Reasoning: {result.judgement.reasoning}")
print(f"Cost: ${result.total_cost:.4f}")

# --- Test Multiple Models ---
tester = PromptTester(openai_api_key="sk-...")

results = asyncio.run(tester.test(
    prompt="Explain recursion to a beginner",
    models=["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"]
))

for r in results:
    print(f"{r.model}: {r.latency_ms}ms, ${r.cost:.4f}")

Matchmaking

Select optimal battle pairings using Monte Carlo simulation:

from promptmachine_eval import MatchmakingService, ModelInfo

service = MatchmakingService()

models = [
    ModelInfo(id="gpt4o", rating=1200, sd=100, battles_count=50),
    ModelInfo(id="claude", rating=1180, sd=120, battles_count=40),
    ModelInfo(id="gemini", rating=1100, sd=200, battles_count=10),
]

# Get optimal pairing (balances competitiveness + uncertainty)
model_a, model_b = service.select_pair_for_battle(models)
print(f"Recommended battle: {model_a.id} vs {model_b.id}")

Configuration

Create promptmachine.yaml in your project:

version: 1

default_models:
  - gpt-4o-mini
  - claude-3-5-haiku

battle:
  judge_model: gpt-4o-mini
  temperature: 0.7

elo:
  k_factor: 32
  initial_rating: 1000

limits:
  max_cost_per_test: 0.10
  daily_budget: 5.00

Or use environment variables:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export OPENROUTER_API_KEY=sk-or-...

Supported Models

Provider Models
OpenAI gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo, o1-preview, o1-mini
Anthropic claude-3-5-sonnet, claude-3-5-haiku, claude-3-opus
OpenRouter gemini-pro-1.5, llama-3.1-70b, mistral-large, deepseek-coder, qwen-max, + more
View full pricing table
Model Input ($/1K) Output ($/1K)
gpt-4o $0.0025 $0.01
gpt-4o-mini $0.00015 $0.0006
claude-3-5-sonnet $0.003 $0.015
claude-3-5-haiku $0.001 $0.005
gemini-pro-1.5 $0.00125 $0.005
llama-3.1-70b $0.00052 $0.00075

ELO Rating System

We use a modified ELO system inspired by Chatbot Arena:

from promptmachine_eval import EloCalculator, EloConfig

# Custom configuration
config = EloConfig(
    k_factor=32,        # Rating volatility (higher = more change)
    initial_rating=1000,
    initial_sd=350,     # Uncertainty (decreases with more battles)
)

elo = EloCalculator(config)

# Expected win probability
prob = elo.expected_score(1200, 1000)
print(f"1200-rated has {prob:.1%} chance vs 1000-rated")
# Output: 1200-rated has 75.9% chance vs 1000-rated

# With uncertainty (Monte Carlo)
prob = elo.win_probability(1200, 1000, sd_a=100, sd_b=200)

Documentation

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

# Clone the repo
git clone https://github.com/framersai/promptmachine-eval.git
cd promptmachine-eval

# Install for development
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
ruff check .
black --check .
mypy src/

License

MIT License — see LICENSE for details.

Links

🌐 PromptMachine🏢 Frame.dev🐙 GitHub🐦 Twitter

Built with ❤️ by Frame.dev
Questions? team@frame.dev

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

promptmachine_eval-0.1.0.tar.gz (31.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

promptmachine_eval-0.1.0-py3-none-any.whl (33.1 kB view details)

Uploaded Python 3

File details

Details for the file promptmachine_eval-0.1.0.tar.gz.

File metadata

  • Download URL: promptmachine_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 31.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for promptmachine_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e2365795dda2b6f799806682cefe5a9db5a1ef79ac6e7562e9ed5ad33c1ff842
MD5 667f72e7784c9c09b0ce42d98cd7cd7c
BLAKE2b-256 363e1a856c240199375d10af92678dad61683183ddd6890ca91e206179ee1ab2

See more details on using hashes here.

File details

Details for the file promptmachine_eval-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for promptmachine_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 42c476b4621804793e9376ceb70afebd046aee74674838a6339849ca1da5359b
MD5 ebf659f1a93119f288893d338e926369
BLAKE2b-256 9ac8901b9c9e2c86d71a5d0a819e91aeb6d9f4e039dc14e75528070fae56a8a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page