Skip to main content

A framework for evaluating LLMs using zero-sum multiplayer simulations

Project description

ZeroSumEval is a framework for evaluating the reasoning abilities of Large Language Models (LLMs) using zero-sum multiplayer simulations. ZSEval uses DSPy for automatic prompt optimization to ensure evaluations are fair.

Table of Contents

Overview

ZeroSumEval aims to create a robust evaluation framework for LLMs using competitive scenarios. Instead of fixed evaluation benchmarks or model-based judging, ZSEval uses multiplayer simulations/games with clear win conditions to pit models against each other.

The framework tests various model capabilities, including knowledge, reasoning, and planning. In addition, ZSEval uses DSPy optimization to test the self-improvement capability of models and ensure the competition between models is fair.

The eval suite consists of a growing number of simulations, including text-based challenges, board games, and Capture The Flag (CTF) competitions.

Key features:

  • One-click evals on the existing suite of games
  • Easily extendable abstractions for new game implementations
  • Integration with DSPy for automated prompt optimization
  • Comprehensive logging and analysis tools

Project Structure

The project is organized as follows:

  • zero_sum_eval/: Main package containing the core framework
    • games/: Individual game implementations
    • managers/: Game and match management classes
  • data/: Game-specific data and examples
  • configs/: Configuration files for different games and scenarios
  • run_game.py: Script to run individual games
  • run_matches.py: Script to run a series of matches

Installation

  1. Clone the repository:

    git clone https://github.com/your-username/ZeroSumEval.git
    cd ZeroSumEval
    
  2. Install the required dependencies:

    pip install -r requirements.txt
    

Usage

To run a game:

python run_game.py -c configs/chess.yaml

To run a series of matches:

python run_matches.py -c configs/mathquiz.yaml

Games

ZeroSumEval currently supports the following games:

  1. Chess
  2. Math Quiz
  3. Gandalf (Password Guessing)
  4. PyJail (Capture The Flag)

Each game is implemented as a separate module in the zero_sum_eval/games/ directory.

Configuration

Game configurations are defined in YAML files located in the configs/ directory. These files specify:

  • Logging settings
  • Game parameters
  • Player configurations
  • LLM settings
Example Configuration (chess.yaml):
logging:
  output_dir: ../output/chess_game
manager:
  args:
    max_rounds: 200
    win_conditions: 
      - Checkmate
    draw_conditions:
      - Stalemate
      - ThreefoldRepetition
      - FiftyMoveRule
      - InsufficientMaterial
game:
  name: chess
  players:
    - name: chess_player
      args:
        id: gpt4 white
        roles: 
          - White
        optimize: false
        dataset: chess_dataset
        dataset_args:
          filename: ./data/chess/stockfish_examples.jsonl
          role: White
        optimizer: MIPROv2
        optimizer_args:
          num_candidates: 5
          minibatch_size: 20
          minibatch_full_eval_steps: 10
        compilation_args:
          max_bootstrapped_demos: 1
          max_labeled_demos: 1
        metric: chess_move_validation_metric
        lm:
          type: AzureOpenAI
          args:
            api_base: https://allam-swn-gpt-01.openai.azure.com/
            api_version: 2023-07-01-preview
            deployment_id: gpt-4o-900ptu
            max_tokens: 800
            temperature: 0.8
            top_p: 0.95
            frequency_penalty: 0
            presence_penalty: 0
        max_tries: 5
    - name: chess_player
      args:
        id: gpt4 black
        roles: 
          - Black
        optimize: false
        dataset: chess_dataset
        dataset_args:
          filename: ./data/chess/stockfish_examples.jsonl
          role: Black
        optimizer: MIPROv2
        optimizer_args:
          num_candidates: 5
          minibatch_size: 20
          minibatch_full_eval_steps: 10
        compilation_args:
          max_bootstrapped_demos: 1
          max_labeled_demos: 1
        metric: chess_move_validation_metric
        lm:
          type: AzureOpenAI
          args:
            api_base: https://allam-swn-gpt-01.openai.azure.com/
            api_version: 2023-07-01-preview
            deployment_id: gpt-4o-900ptu
            max_tokens: 800
            temperature: 0.8
            top_p: 0.95
            frequency_penalty: 0
            presence_penalty: 0
        max_tries: 5

Contributing

Contributions to ZeroSumEval are welcome! Please open a pull request

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zero_sum_eval-0.1.0.tar.gz (45.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zero_sum_eval-0.1.0-py3-none-any.whl (60.4 kB view details)

Uploaded Python 3

File details

Details for the file zero_sum_eval-0.1.0.tar.gz.

File metadata

  • Download URL: zero_sum_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 45.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for zero_sum_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e7ace94fb9f795931d73993337a0aa2a3b8e952cfd125d9fd91435e9a4476777
MD5 ee5077d8165c3d81196d9f4aedc8cbf7
BLAKE2b-256 857adb279e062c67d6c82299c4dd55e37de77c9be139056134d5170ff6e3dfd9

See more details on using hashes here.

File details

Details for the file zero_sum_eval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: zero_sum_eval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 60.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for zero_sum_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e8cea705870bd610041f12d6844a094c8000639fee7518a6bbeb2b91da917f60
MD5 fab7f651b67ddcd4a1beb512a30e35f2
BLAKE2b-256 930aa2113a41c45555141843002b62f7ab3c877510a7232949635a36ca28d154

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page