A framework for evaluating LLMs using zero-sum multiplayer simulations
Project description
ZeroSumEval: An extensible framework for evaluating LLMs using games! ⚔
Table of Contents
- Overview
- Project Structure
- Installation
- Usage
- Games
- Configuration
- Acknowledgments
- Citation
- Contributing
- License
Overview
ZeroSumEval is a dynamic evaluation benchmark for LLMs using competitive scenarios that scales with model capabilities (i.e. as models get better, the benchmark gets harder). Instead of fixed evaluation benchmarks or subjective judging criteria, ZeroSumEval uses multi-agent simulations with clear win conditions to pit models against each other.
The framework tests various model capabilities, including knowledge, reasoning, and planning. In addition, ZeroSumEval uses DSPy optimization to test the self-improvement capability of models and ensure the competition between models is fair.
The eval suite consists of a growing number of simulations, including text-based challenges, board games, and Capture The Flag (CTF) competitions.
Performance comparison of different LLMs across various games in ZeroSumEval (March 9, 2025)
Key features:
- One-click evals on the existing suite of games
- Easily extendable abstractions for new game implementations
- Integration with DSPy for automated prompt optimization
- Comprehensive logging and analysis tools
Project Structure
The project is organized as follows:
zero_sum_eval/: Main package containing the core frameworkanalysis/: Modules for analyzing game performance and calculating ratingscore/: Core game-related components, including player and game state managementgames/: Individual game implementationsmanagers/: Game and match management classesutils/: Utility functions for logging, configuration, checkpointing, and type definitionsmain.py: Entry point for running games and matches
data/: Game-specific data and examplesconfigs/: Configuration files for different games and scenarios
Installation
-
Use
pipto install ZeroSumEval:pip install zero-sum-eval -
test installation:
zseval --help
Usage
It's possible to run a single game or a series of matches with or without a detailed config file.
Running without a config file
single game:
zseval -g chess -p "white=openai/gpt-4o" "black=openai/gpt-4o"
pool of matches:
zseval --pool -g chess -p "white=openai/gpt-4o" "black=openai/gpt-4o"
Running from a config file
single game:
zseval -c configs/chess.yaml
pool of matches:
zseval --pool -c configs/pool/chess.yaml
Rating calculation
Add the --calculate_ratings flag to output ELO ratings for the models after a pool of matches:
zseval --pool -c configs/pool/chess.yaml --calculate_ratings
Or directly calculate the ratings from a given match pool log directory:
zseval --calculate_ratings --output_dir match_pool_log/
Games
ZeroSumEval currently supports the following games:
- Chess
- Debate
- Gandalf (Password Guessing)
- Liar's Dice
- Math Quiz
- Poker (Simple Texas Hold'em)
- PyJail (Capture The Flag)
Each game is implemented as a separate module in the zero_sum_eval/games/ directory.
Configuration
Game configurations are defined in YAML files located in the configs/ directory. These files specify:
- Logging settings
- Manager settings
- Game parameters
- Player configurations
- LLM settings
Example Configuration (chess.yaml):
logging:
output_dir: ../output/chess_game
manager:
args:
max_player_attempts: 5
max_rounds: 200
game:
name: chess
args:
players:
white:
class: chess_player
args:
id: llama3.1 70b white
actions:
- name: MakeMove
optimize: true
metric: chess_move_validation_metric
dataset: chess_dataset
dataset_args:
filename: ./data/chess/stockfish_examples.jsonl
player_key: white
num_examples: 10
lm:
model: openrouter/meta-llama/llama-3.3-70b-instruct
optimizer: BootstrapFewshot
optimizer_args:
max_bootstrapped_demos: 1
max_tries: 5
black:
class: chess_player
args:
id: llama3.3 70b black
lm:
model: openrouter/meta-llama/llama-3.3-70b-instruct
max_tries: 5
Acknowledgements
Many thanks to Hisham Alyahya, Yazeed Alnumay, Colton Ritchie, and M Saiful Bari for their active contributions to the project. Because the project moved repositories, we were unable to preserve the commit history of the original repository.
Citation
If you use ZeroSumEval in your work, please cite the following papers:
Paper:
@misc{khan2025zerosumevalscalingllmevaluation,
title={ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition},
author={Haidar Khan and Hisham A. Alyahya and Yazeed Alnumay and M Saiful Bari and Bülent Yener},
year={2025},
eprint={2504.12562},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2504.12562},
}
Demo Paper:
@misc{alyahya2025zerosumevalextensibleframeworkscaling,
title={ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition},
author={Hisham A. Alyahya and Haidar Khan and Yazeed Alnumay and M Saiful Bari and Bülent Yener},
year={2025},
eprint={2503.10673},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.10673},
}
Contributing
Contributions to ZeroSumEval are welcome! Please follow the contribution guidelines and open a pull request or issue on the GitHub repository.
License
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
ZeroSumEval collects model outputs in its logs for analysis and evaluation purposes. Each model's outputs are subject to the terms and conditions of the model's license and should be used in accordance with those terms.
Star History
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zero_sum_eval-0.3.0.tar.gz.
File metadata
- Download URL: zero_sum_eval-0.3.0.tar.gz
- Upload date:
- Size: 68.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d729d0e84a357a779247e53df2258d96b0b0f24e0bd996c2912161f688108c5
|
|
| MD5 |
95b7dff5f1796869f9964e13fa2c80fa
|
|
| BLAKE2b-256 |
2619dc9b576975cbb2b5c1b01379b6138050b3b4c360df74b553b6729c79c29a
|
File details
Details for the file zero_sum_eval-0.3.0-py3-none-any.whl.
File metadata
- Download URL: zero_sum_eval-0.3.0-py3-none-any.whl
- Upload date:
- Size: 64.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b48a02fb41e3c5146bf83de320b36cfacacf48ca046e23f6bc50884ea858c28b
|
|
| MD5 |
8966562db6d264474413c99324da9c27
|
|
| BLAKE2b-256 |
90499684c2c6324c7a62d94bda17f27b0ff574122c6c1904bb4daf5b4bef8dd6
|