A high-performance Elo rating calculation library for tournaments and competitions
Project description
Duelboard
A high-performance Elo rating calculation library for tournaments and competitions, inspired by the Chatbot Arena rating system.
Features
- Multiple Calculation Methods: Basic Elo, Bootstrap with confidence intervals, and Maximum Likelihood Estimation
- High Performance: Optimized for large datasets with thousands of battles
- Type Safe: Full type annotations and modern Python practices
- Flexible Data Input: Support for pandas DataFrames, CSV/JSON files, or Battle objects
- Comprehensive Analysis: Built-in tools for win rate prediction and pairwise analysis
- Optional Visualization: Beautiful plots with Plotly (optional dependency)
Installation
# Basic installation
pip install duelboard
# With visualization support
pip install duelboard[visualization]
# Development installation
git clone https://github.com/jannchie/duelboard.git
cd duelboard
uv sync --extra visualization --group dev # with visualization and dev tools
Quick Start
import pandas as pd
import duelboard as db
# Load your battle data
battles_df = pd.DataFrame([
{"player_a": "gpt-4", "player_b": "claude-v1", "winner": "player_a"},
{"player_a": "claude-v1", "player_b": "gpt-3.5-turbo", "winner": "player_a"},
{"player_a": "gpt-4", "player_b": "gpt-3.5-turbo", "winner": "player_a"},
# ... more battles
])
# Calculate Elo ratings
calculator = db.EloCalculator(k_factor=4)
ratings = calculator.calculate(battles_df)
# Get leaderboard
leaderboard = db.get_rating_summary(ratings)
print(leaderboard)
Calculation Methods
1. Basic Elo Calculator
Standard online Elo rating updates:
calculator = db.EloCalculator(
k_factor=4, # Lower = more stable ratings
scale=400, # Elo scale parameter
initial_rating=1000 # Starting rating
)
ratings = calculator.calculate(battles_df)
2. Bootstrap Elo Calculator
Provides confidence intervals through bootstrap sampling:
bootstrap_calc = db.BootstrapEloCalculator(
k_factor=4,
n_bootstrap=1000, # Number of bootstrap samples
confidence_level=0.95,
random_seed=42
)
ratings = bootstrap_calc.calculate(battles_df)
# Access confidence intervals
for player, rating in ratings.items():
print(f"{player}: {rating.rating:.0f} [{rating.confidence_interval[0]:.0f}, {rating.confidence_interval[1]:.0f}]")
3. Maximum Likelihood Estimation
Uses logistic regression for more stable ratings:
mle_calc = db.MLEEloCalculator(random_state=42)
ratings = mle_calc.calculate(battles_df)
# With bootstrap confidence intervals
ratings = mle_calc.calculate_with_bootstrap(battles_df, n_bootstrap=500)
Analysis Tools
Win Rate Prediction
predictor = db.WinRatePredictor()
# Predict win probability
prob = predictor.predict_win_probability(1200, 1000) # ratings
print(f"Win probability: {prob:.3f}")
# Create win rate matrix
win_matrix = predictor.create_win_rate_matrix(ratings)
Battle Statistics
analyzer = db.PairwiseAnalyzer()
# Basic battle statistics
stats = analyzer.compute_battle_statistics(battles_df)
# Pairwise win fractions
win_fractions = analyzer.compute_pairwise_win_fraction(battles_df)
# Battle count matrix
battle_counts = analyzer.visualize_battle_count_matrix(battles_df)
Visualization (Optional)
# Install with: pip install duelboard[visualization]
import duelboard.visualization as viz
# Plot leaderboard with confidence intervals
fig = viz.plot_leaderboard(ratings, show_confidence_intervals=True)
fig.show()
# Plot win rate matrix
win_matrix = predictor.create_win_rate_matrix(ratings)
fig = viz.plot_win_rate_matrix(win_matrix)
fig.show()
# Plot battle count matrix
battle_counts = analyzer.visualize_battle_count_matrix(battles_df)
fig = viz.plot_battle_count_matrix(battle_counts)
fig.show()
Data Formats
DataFrame Format
battles_df = pd.DataFrame({
'player_a': ['player1', 'player2', 'player1'],
'player_b': ['player2', 'player3', 'player3'],
'winner': ['player_a', 'player_b', 'tie']
})
Battle Objects
# Recommended: Use intuitive win/tie methods
battles = [
db.Battle.win('player1', 'player2'), # player1 beats player2
db.Battle.win('player3', 'player2'), # player3 beats player2
db.Battle.tie('player1', 'player3'), # tie between player1 and player3
]
# Traditional API with outcome enums (for advanced use cases)
battles = [
db.Battle('player1', 'player2', db.BattleOutcome.WIN_A),
db.Battle('player2', 'player3', db.BattleOutcome.WIN_B),
db.Battle('player1', 'player3', db.BattleOutcome.TIE)
]
Load from Files
# From CSV
battles = db.load_battles_from_csv('battles.csv')
# From JSON
battles = db.load_battles_from_json('battles.json')
Advanced Usage
Filter Anonymous Battles (Chatbot Arena Style)
# Filter to only anonymous battles
anonymous_df = db.filter_anonymous_battles(df, anony_col='anony')
# Filter out ties
no_ties_df = db.filter_non_tie_battles(df)
Even Sampling Across Model Pairs
bootstrap_calc = db.BootstrapEloCalculator()
ratings = bootstrap_calc.calculate_even_sample(
battles_df,
n_per_pair=50 # Sample 50 battles per model pair
)
Export Results
# Export to CSV
db.export_ratings_to_csv(ratings, 'ratings.csv')
# Get ranked player list
ranked_players = db.rank_players_by_rating(ratings)
Performance Tips
- Use
k_factor=4for stable ratings (as used in Chatbot Arena) - For large datasets, consider filtering to anonymous battles only
- Use MLE calculator for most stable results
- Bootstrap calculations are slower but provide uncertainty estimates
- Reduce
n_bootstrapfor faster computation during development
Development
# Install dependencies with visualization and dev tools
uv sync --extra visualization --group dev
# Run tests
pytest
# Run tests with coverage
pytest --cov=src/duelboard --cov-report=html
# Run examples
python examples/simple_example.py
python examples/basic_usage.py
python examples/visualization_example.py
# Format and lint code
ruff check --fix
License
MIT License. See LICENSE file for details.
Citation
Inspired by the Chatbot Arena Elo rating system. If you use this library in academic work, please cite:
Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., ... & Stoica, I. (2023).
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file duelboard-0.1.0.tar.gz.
File metadata
- Download URL: duelboard-0.1.0.tar.gz
- Upload date:
- Size: 6.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5d6ae678b424257dc563219aacd96a0d8be473af23739bc91d131992c61e1a1
|
|
| MD5 |
4d16c1a058e362cc7e5d390d125c4441
|
|
| BLAKE2b-256 |
10a21cde3ac1650d66a683938049864ce83255cae996285e31771c438941e813
|
File details
Details for the file duelboard-0.1.0-py3-none-any.whl.
File metadata
- Download URL: duelboard-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31e1d7f0da3f6a87642cd5b3c409a04b4f529e37f24297fee57620e6cf746e32
|
|
| MD5 |
094f469351f4ab3ec9499f220490726a
|
|
| BLAKE2b-256 |
f1d789fd2e871d3d68121fad3649030ef1af1f61111b3de57e73ce301e67334e
|