Skip to main content

A high-performance Elo rating calculation library for tournaments and competitions

Project description

Duelboard

A high-performance Elo rating calculation library for tournaments and competitions, inspired by the Chatbot Arena rating system.

Features

  • Multiple Calculation Methods: Basic Elo, Bootstrap with confidence intervals, and Maximum Likelihood Estimation
  • High Performance: Optimized for large datasets with thousands of battles
  • Type Safe: Full type annotations and modern Python practices
  • Flexible Data Input: Support for pandas DataFrames, CSV/JSON files, or Battle objects
  • Comprehensive Analysis: Built-in tools for win rate prediction and pairwise analysis
  • Optional Visualization: Beautiful plots with Plotly (optional dependency)

Installation

# Basic installation
pip install duelboard

# With visualization support
pip install duelboard[visualization]

# Development installation
git clone https://github.com/jannchie/duelboard.git
cd duelboard
uv sync --extra visualization --group dev  # with visualization and dev tools

Quick Start

import pandas as pd
import duelboard as db

# Load your battle data
battles_df = pd.DataFrame([
    {"player_a": "gpt-4", "player_b": "claude-v1", "winner": "player_a"},
    {"player_a": "claude-v1", "player_b": "gpt-3.5-turbo", "winner": "player_a"},
    {"player_a": "gpt-4", "player_b": "gpt-3.5-turbo", "winner": "player_a"},
    # ... more battles
])

# Calculate Elo ratings
calculator = db.EloCalculator(k_factor=4)
ratings = calculator.calculate(battles_df)

# Get leaderboard
leaderboard = db.get_rating_summary(ratings)
print(leaderboard)

Calculation Methods

1. Basic Elo Calculator

Standard online Elo rating updates:

calculator = db.EloCalculator(
    k_factor=4,           # Lower = more stable ratings
    scale=400,            # Elo scale parameter
    initial_rating=1000   # Starting rating
)
ratings = calculator.calculate(battles_df)

2. Bootstrap Elo Calculator

Provides confidence intervals through bootstrap sampling:

bootstrap_calc = db.BootstrapEloCalculator(
    k_factor=4,
    n_bootstrap=1000,     # Number of bootstrap samples
    confidence_level=0.95,
    random_seed=42
)
ratings = bootstrap_calc.calculate(battles_df)

# Access confidence intervals
for player, rating in ratings.items():
    print(f"{player}: {rating.rating:.0f} [{rating.confidence_interval[0]:.0f}, {rating.confidence_interval[1]:.0f}]")

3. Maximum Likelihood Estimation

Uses logistic regression for more stable ratings:

mle_calc = db.MLEEloCalculator(random_state=42)
ratings = mle_calc.calculate(battles_df)

# With bootstrap confidence intervals
ratings = mle_calc.calculate_with_bootstrap(battles_df, n_bootstrap=500)

Analysis Tools

Win Rate Prediction

predictor = db.WinRatePredictor()

# Predict win probability
prob = predictor.predict_win_probability(1200, 1000)  # ratings
print(f"Win probability: {prob:.3f}")

# Create win rate matrix
win_matrix = predictor.create_win_rate_matrix(ratings)

Battle Statistics

analyzer = db.PairwiseAnalyzer()

# Basic battle statistics
stats = analyzer.compute_battle_statistics(battles_df)

# Pairwise win fractions
win_fractions = analyzer.compute_pairwise_win_fraction(battles_df)

# Battle count matrix
battle_counts = analyzer.visualize_battle_count_matrix(battles_df)

Visualization (Optional)

# Install with: pip install duelboard[visualization]
import duelboard.visualization as viz

# Plot leaderboard with confidence intervals
fig = viz.plot_leaderboard(ratings, show_confidence_intervals=True)
fig.show()

# Plot win rate matrix
win_matrix = predictor.create_win_rate_matrix(ratings)
fig = viz.plot_win_rate_matrix(win_matrix)
fig.show()

# Plot battle count matrix
battle_counts = analyzer.visualize_battle_count_matrix(battles_df)
fig = viz.plot_battle_count_matrix(battle_counts)
fig.show()

Data Formats

DataFrame Format

battles_df = pd.DataFrame({
    'player_a': ['player1', 'player2', 'player1'],
    'player_b': ['player2', 'player3', 'player3'], 
    'winner': ['player_a', 'player_b', 'tie']
})

Battle Objects

# Recommended: Use intuitive win/tie methods
battles = [
    db.Battle.win('player1', 'player2'),    # player1 beats player2
    db.Battle.win('player3', 'player2'),    # player3 beats player2
    db.Battle.tie('player1', 'player3'),    # tie between player1 and player3
]

# Traditional API with outcome enums (for advanced use cases)
battles = [
    db.Battle('player1', 'player2', db.BattleOutcome.WIN_A),
    db.Battle('player2', 'player3', db.BattleOutcome.WIN_B),
    db.Battle('player1', 'player3', db.BattleOutcome.TIE)
]

Load from Files

# From CSV
battles = db.load_battles_from_csv('battles.csv')

# From JSON  
battles = db.load_battles_from_json('battles.json')

Advanced Usage

Filter Anonymous Battles (Chatbot Arena Style)

# Filter to only anonymous battles
anonymous_df = db.filter_anonymous_battles(df, anony_col='anony')

# Filter out ties
no_ties_df = db.filter_non_tie_battles(df)

Even Sampling Across Model Pairs

bootstrap_calc = db.BootstrapEloCalculator()
ratings = bootstrap_calc.calculate_even_sample(
    battles_df, 
    n_per_pair=50  # Sample 50 battles per model pair
)

Export Results

# Export to CSV
db.export_ratings_to_csv(ratings, 'ratings.csv')

# Get ranked player list
ranked_players = db.rank_players_by_rating(ratings)

Performance Tips

  • Use k_factor=4 for stable ratings (as used in Chatbot Arena)
  • For large datasets, consider filtering to anonymous battles only
  • Use MLE calculator for most stable results
  • Bootstrap calculations are slower but provide uncertainty estimates
  • Reduce n_bootstrap for faster computation during development

Development

# Install dependencies with visualization and dev tools
uv sync --extra visualization --group dev

# Run tests
pytest

# Run tests with coverage
pytest --cov=src/duelboard --cov-report=html

# Run examples
python examples/simple_example.py
python examples/basic_usage.py
python examples/visualization_example.py

# Format and lint code
ruff check --fix

License

MIT License. See LICENSE file for details.

Citation

Inspired by the Chatbot Arena Elo rating system. If you use this library in academic work, please cite:

Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., ... & Stoica, I. (2023). 
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duelboard-0.1.0.tar.gz (6.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duelboard-0.1.0-py3-none-any.whl (18.3 kB view details)

Uploaded Python 3

File details

Details for the file duelboard-0.1.0.tar.gz.

File metadata

  • Download URL: duelboard-0.1.0.tar.gz
  • Upload date:
  • Size: 6.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.2

File hashes

Hashes for duelboard-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f5d6ae678b424257dc563219aacd96a0d8be473af23739bc91d131992c61e1a1
MD5 4d16c1a058e362cc7e5d390d125c4441
BLAKE2b-256 10a21cde3ac1650d66a683938049864ce83255cae996285e31771c438941e813

See more details on using hashes here.

File details

Details for the file duelboard-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: duelboard-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.2

File hashes

Hashes for duelboard-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 31e1d7f0da3f6a87642cd5b3c409a04b4f529e37f24297fee57620e6cf746e32
MD5 094f469351f4ab3ec9499f220490726a
BLAKE2b-256 f1d789fd2e871d3d68121fad3649030ef1af1f61111b3de57e73ce301e67334e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page