A framework for evaluating LLMs using zero-sum multiplayer simulations
Project description
ZeroSumEval is a framework for evaluating the reasoning abilities of Large Language Models (LLMs) using zero-sum multiplayer simulations. ZSEval uses DSPy for automatic prompt optimization to ensure evaluations are fair.
Table of Contents
Overview
ZeroSumEval aims to create a robust evaluation framework for LLMs using competitive scenarios. Instead of fixed evaluation benchmarks or model-based judging, ZSEval uses multiplayer simulations/games with clear win conditions to pit models against each other.
The framework tests various model capabilities, including knowledge, reasoning, and planning. In addition, ZSEval uses DSPy optimization to test the self-improvement capability of models and ensure the competition between models is fair.
The eval suite consists of a growing number of simulations, including text-based challenges, board games, and Capture The Flag (CTF) competitions.
Key features:
- One-click evals on the existing suite of games
- Easily extendable abstractions for new game implementations
- Integration with DSPy for automated prompt optimization
- Comprehensive logging and analysis tools
Project Structure
The project is organized as follows:
zero_sum_eval/: Main package containing the core frameworkgames/: Individual game implementationsmanagers/: Game and match management classes
data/: Game-specific data and examplesconfigs/: Configuration files for different games and scenariosrun_game.py: Script to run individual gamesrun_matches.py: Script to run a series of matches
Installation
-
Clone the repository:
git clone https://github.com/your-username/ZeroSumEval.git cd ZeroSumEval -
Install the required dependencies:
pip install -r requirements.txt
Usage
To run a game:
python run_game.py -c configs/chess.yaml
To run a series of matches:
python run_matches.py -c configs/mathquiz.yaml
Games
ZeroSumEval currently supports the following games:
- Chess
- Math Quiz
- Gandalf (Password Guessing)
- PyJail (Capture The Flag)
Each game is implemented as a separate module in the zero_sum_eval/games/ directory.
Configuration
Game configurations are defined in YAML files located in the configs/ directory. These files specify:
- Logging settings
- Game parameters
- Player configurations
- LLM settings
Example Configuration (chess.yaml):
logging:
output_dir: ../output/chess_game
manager:
args:
max_rounds: 200
win_conditions:
- Checkmate
draw_conditions:
- Stalemate
- ThreefoldRepetition
- FiftyMoveRule
- InsufficientMaterial
game:
name: chess
players:
- name: chess_player
args:
id: gpt4 white
roles:
- White
optimize: false
dataset: chess_dataset
dataset_args:
filename: ./data/chess/stockfish_examples.jsonl
role: White
optimizer: MIPROv2
optimizer_args:
num_candidates: 5
minibatch_size: 20
minibatch_full_eval_steps: 10
compilation_args:
max_bootstrapped_demos: 1
max_labeled_demos: 1
metric: chess_move_validation_metric
lm:
type: AzureOpenAI
args:
api_base: https://allam-swn-gpt-01.openai.azure.com/
api_version: 2023-07-01-preview
deployment_id: gpt-4o-900ptu
max_tokens: 800
temperature: 0.8
top_p: 0.95
frequency_penalty: 0
presence_penalty: 0
max_tries: 5
- name: chess_player
args:
id: gpt4 black
roles:
- Black
optimize: false
dataset: chess_dataset
dataset_args:
filename: ./data/chess/stockfish_examples.jsonl
role: Black
optimizer: MIPROv2
optimizer_args:
num_candidates: 5
minibatch_size: 20
minibatch_full_eval_steps: 10
compilation_args:
max_bootstrapped_demos: 1
max_labeled_demos: 1
metric: chess_move_validation_metric
lm:
type: AzureOpenAI
args:
api_base: https://allam-swn-gpt-01.openai.azure.com/
api_version: 2023-07-01-preview
deployment_id: gpt-4o-900ptu
max_tokens: 800
temperature: 0.8
top_p: 0.95
frequency_penalty: 0
presence_penalty: 0
max_tries: 5
Contributing
Contributions to ZeroSumEval are welcome! Please open a pull request
License
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zero_sum_eval-0.1.0.tar.gz.
File metadata
- Download URL: zero_sum_eval-0.1.0.tar.gz
- Upload date:
- Size: 45.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7ace94fb9f795931d73993337a0aa2a3b8e952cfd125d9fd91435e9a4476777
|
|
| MD5 |
ee5077d8165c3d81196d9f4aedc8cbf7
|
|
| BLAKE2b-256 |
857adb279e062c67d6c82299c4dd55e37de77c9be139056134d5170ff6e3dfd9
|
File details
Details for the file zero_sum_eval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: zero_sum_eval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 60.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8cea705870bd610041f12d6844a094c8000639fee7518a6bbeb2b91da917f60
|
|
| MD5 |
fab7f651b67ddcd4a1beb512a30e35f2
|
|
| BLAKE2b-256 |
930aa2113a41c45555141843002b62f7ab3c877510a7232949635a36ca28d154
|