zero-sum-eval

A framework for evaluating LLMs using zero-sum multiplayer simulations

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

ZeroSumEval: An extensible framework for evaluating LLMs using games! ⚔

Version Pypi Downloads

Overview
Project Structure
Installation
Usage
Games
Configuration
Acknowledgments
Citation
Contributing
License

Overview

ZeroSumEval is a dynamic evaluation benchmark for LLMs using competitive scenarios that scales with model capabilities (i.e. as models get better, the benchmark gets harder). Instead of fixed evaluation benchmarks or subjective judging criteria, ZeroSumEval uses multi-agent simulations with clear win conditions to pit models against each other.

The framework tests various model capabilities, including knowledge, reasoning, and planning. In addition, ZeroSumEval uses DSPy optimization to test the self-improvement capability of models and ensure the competition between models is fair.

The eval suite consists of a growing number of simulations, including text-based challenges, board games, and Capture The Flag (CTF) competitions.

Performance comparison of different LLMs across various games in ZeroSumEval (March 9, 2025)

Key features:

One-click evals on the existing suite of games
Easily extendable abstractions for new game implementations
Integration with DSPy for automated prompt optimization
Comprehensive logging and analysis tools

Project Structure

The project is organized as follows:

zero_sum_eval/: Main package containing the core framework
- analysis/: Modules for analyzing game performance and calculating ratings
- core/: Core game-related components, including player and game state management
- games/: Individual game implementations
- managers/: Game and match management classes
- utils/: Utility functions for logging, configuration, checkpointing, and type definitions
- main.py: Entry point for running games and matches
data/: Game-specific data and examples
configs/: Configuration files for different games and scenarios

Installation

Use pip to install ZeroSumEval:
```
pip install zero-sum-eval
```
test installation:
```
zseval --help
```

Usage

It's possible to run a single game or a series of matches with or without a detailed config file.

Running without a config file

single game:

zseval -g chess -p "white=openai/gpt-4o" "black=openai/gpt-4o"

pool of matches:

zseval --pool -g chess -p "white=openai/gpt-4o" "black=openai/gpt-4o"

Running from a config file

single game:

zseval -c configs/chess.yaml

pool of matches:

zseval --pool -c configs/pool/chess.yaml

Rating calculation

Add the --calculate_ratings flag to output ELO ratings for the models after a pool of matches:

zseval --pool -c configs/pool/chess.yaml --calculate_ratings

Or directly calculate the ratings from a given match pool log directory:

zseval --calculate_ratings --output_dir match_pool_log/

Games

ZeroSumEval currently supports the following games:

Chess
Debate
Gandalf (Password Guessing)
Liar's Dice
Math Quiz
Poker (Simple Texas Hold'em)
PyJail (Capture The Flag)

Each game is implemented as a separate module in the zero_sum_eval/games/ directory.

Configuration

Game configurations are defined in YAML files located in the configs/ directory. These files specify:

Logging settings
Manager settings
Game parameters
Player configurations
LLM settings

Example Configuration (chess.yaml):

logging:
  output_dir: ../output/chess_game
manager:
  args:
    max_player_attempts: 5
    max_rounds: 200
game:
  name: chess
  args:
    players:
      white:
        class: chess_player
        args:
          id: llama3.1 70b white
          actions:
            - name: MakeMove
              optimize: true
              metric: chess_move_validation_metric
              dataset: chess_dataset
              dataset_args:
                filename: ./data/chess/stockfish_examples.jsonl
                player_key: white
                num_examples: 10
          lm:
            model: openrouter/meta-llama/llama-3.3-70b-instruct
          optimizer: BootstrapFewshot
          optimizer_args:
            max_bootstrapped_demos: 1
          max_tries: 5
      black:
        class: chess_player
        args:
          id: llama3.3 70b black
          lm:
            model: openrouter/meta-llama/llama-3.3-70b-instruct
          max_tries: 5

Acknowledgements

Many thanks to Hisham Alyahya, Yazeed Alnumay, Colton Ritchie, and M Saiful Bari for their active contributions to the project. Because the project moved repositories, we were unable to preserve the commit history of the original repository.

Citation

If you use ZeroSumEval in your work, please cite the following papers:

Paper:

@misc{khan2025zerosumevalscalingllmevaluation,
      title={ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition},
      author={Haidar Khan and Hisham A. Alyahya and Yazeed Alnumay and M Saiful Bari and Bülent Yener},
      year={2025},
      eprint={2504.12562},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2504.12562},
}

Demo Paper:

@misc{alyahya2025zerosumevalextensibleframeworkscaling,
      title={ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition},
      author={Hisham A. Alyahya and Haidar Khan and Yazeed Alnumay and M Saiful Bari and Bülent Yener},
      year={2025},
      eprint={2503.10673},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.10673},
}

Contributing

Contributions to ZeroSumEval are welcome! Please follow the contribution guidelines and open a pull request or issue on the GitHub repository.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

ZeroSumEval collects model outputs in its logs for analysis and evaluation purposes. Each model's outputs are subject to the terms and conditions of the model's license and should be used in accordance with those terms.

Star History

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

This version

0.3.0

Apr 20, 2025

0.1.0

Feb 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zero_sum_eval-0.3.0.tar.gz (68.2 kB view details)

Uploaded Apr 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

zero_sum_eval-0.3.0-py3-none-any.whl (64.0 kB view details)

Uploaded Apr 20, 2025 Python 3

File details

Details for the file zero_sum_eval-0.3.0.tar.gz.

File metadata

Download URL: zero_sum_eval-0.3.0.tar.gz
Upload date: Apr 20, 2025
Size: 68.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for zero_sum_eval-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`9d729d0e84a357a779247e53df2258d96b0b0f24e0bd996c2912161f688108c5`
MD5	`95b7dff5f1796869f9964e13fa2c80fa`
BLAKE2b-256	`2619dc9b576975cbb2b5c1b01379b6138050b3b4c360df74b553b6729c79c29a`

See more details on using hashes here.

File details

Details for the file zero_sum_eval-0.3.0-py3-none-any.whl.

File metadata

Download URL: zero_sum_eval-0.3.0-py3-none-any.whl
Upload date: Apr 20, 2025
Size: 64.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for zero_sum_eval-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b48a02fb41e3c5146bf83de320b36cfacacf48ca046e23f6bc50884ea858c28b`
MD5	`8966562db6d264474413c99324da9c27`
BLAKE2b-256	`90499684c2c6324c7a62d94bda17f27b0ff574122c6c1904bb4daf5b4bef8dd6`

See more details on using hashes here.

zero-sum-eval 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Table of Contents

Overview

Project Structure

Installation

Usage

Running without a config file

Running from a config file

Rating calculation

Games

Configuration

Acknowledgements

Citation

Paper:

Demo Paper:

Contributing

License

Star History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes