ATP plugin for game-theoretic agent evaluation
Project description
atp-games
ATP plugin for game-theoretic agent evaluation
Overview
atp-games bridges the standalone game-environments library with the ATP Platform, enabling game-theoretic evaluation of AI agents through the standard ATP testing pipeline. It provides:
- GameRunner -- orchestrates multi-agent game execution via ATP protocol
- Protocol mapping -- converts game observations to ATP requests and responses back to game actions
- Game-theoretic evaluators -- payoff analysis, exploitability, cooperation metrics, equilibrium distance
- YAML game suites -- declarative game evaluation definitions
- Tournament & cross-play -- round-robin, elimination brackets, agent comparison matrices
Installation
cd atp-games
uv sync
# Or install as a dependency
uv add atp-games --path ./atp-games
Dependencies
atp-platform(parent package)game-environments(game library)numpy(for Nash solver and exploitability analysis)
Quick Start
Run a built-in game suite
# Evaluate two strategies on Prisoner's Dilemma
uv run atp test --suite=game:prisoners_dilemma.yaml
Programmatic usage
import asyncio
from game_envs import PrisonersDilemma, PDConfig, TitForTat, AlwaysDefect
from atp_games import (
GameRunner, GameRunConfig, BuiltinAdapter,
)
async def main():
# Create game
game = PrisonersDilemma(PDConfig(num_rounds=50))
# Wrap strategies as ATP-compatible adapters
agents = {
"player_0": BuiltinAdapter(TitForTat()),
"player_1": BuiltinAdapter(AlwaysDefect()),
}
# Run evaluation
runner = GameRunner()
result = await runner.run_game(
game=game,
agents=agents,
config=GameRunConfig(episodes=20, base_seed=42),
)
# Analyze results
print(f"Episodes: {result.num_episodes}")
print(f"Average payoffs: {result.average_payoffs}")
for stat in result.player_statistics():
print(
f" {stat.player_id}: "
f"mean={stat.mean:.2f} "
f"95% CI=[{stat.ci_lower:.2f}, {stat.ci_upper:.2f}]"
)
# Compare agents (Welch's t-test)
for cmp in result.agent_comparisons():
print(
f" {cmp.player_a} vs {cmp.player_b}: "
f"p={cmp.p_value:.4f} "
f"{'significant' if cmp.is_significant else 'not significant'}"
)
asyncio.run(main())
Game Suite YAML Format
Game suites define complete evaluation scenarios in YAML:
type: game_suite
name: PD Cooperation Test
version: "1.0"
game:
type: prisoners_dilemma
variant: repeated # "one_shot" or "repeated"
config:
num_rounds: 100
noise: 0.0
discount_factor: 1.0
agents:
- name: my_agent
adapter: http
endpoint: ${AGENT_ENDPOINT} # Variable substitution for CI
- name: baseline_tft
adapter: builtin
strategy: tit_for_tat
evaluation:
episodes: 50
metrics:
- type: average_payoff
weight: 1.0
- type: exploitability
weight: 0.5
config:
epsilon: 0.15
- type: cooperation
weight: 0.5
thresholds:
average_payoff:
min: 1.0
reporting:
strategy_profile: true
payoff_matrix: true
round_by_round: true
export_formats:
- json
- csv
YAML Reference
game section
| Field | Type | Description |
|---|---|---|
type |
string | Game name from registry (prisoners_dilemma, auction, colonel_blotto, congestion, public_goods) |
variant |
string | "one_shot" or "repeated" |
config |
dict | Game-specific config (passed to game constructor) |
agents section
Each agent entry:
| Field | Type | Description |
|---|---|---|
name |
string | Display name |
adapter |
string | "builtin", "http", "cli", "docker" |
strategy |
string | For builtin adapter: strategy name from registry |
endpoint |
string | For http adapter: URL |
config |
dict | Additional adapter configuration |
evaluation section
| Field | Type | Description |
|---|---|---|
episodes |
int | Number of game episodes to run |
metrics |
list | Evaluator metrics to compute |
thresholds |
dict | Pass/fail thresholds per metric |
Metric types: average_payoff, exploitability, cooperation, equilibrium.
Variable substitution
Use ${VAR_NAME} for environment variable substitution (useful for CI):
agents:
- name: my_agent
adapter: http
endpoint: ${AGENT_ENDPOINT}
Suite inheritance
Extend a base suite:
extends: base_pd.yaml
evaluation:
episodes: 100 # Override episode count
Evaluators
Four game-theoretic evaluators integrate with the ATP scoring pipeline:
PayoffEvaluator
Evaluates game outcomes based on payoff metrics.
Checks:
- Average payoff per player (with min/max thresholds)
- Payoff distribution (min, max, median, percentiles)
- Social welfare (sum of average payoffs)
- Pareto efficiency
metrics:
- type: average_payoff
weight: 1.0
config:
min_payoff:
player_0: 2.0
min_social_welfare: 4.0
pareto_check: true
ExploitabilityEvaluator
Measures how exploitable an agent's strategy is.
Checks:
- Per-player exploitability (best-response payoff gap)
- Total exploitability
- Empirical strategy extraction
metrics:
- type: exploitability
weight: 0.5
config:
epsilon: 0.15 # Max exploitability for pass
payoff_matrix_1: [[3, 0], [5, 1]]
payoff_matrix_2: [[3, 5], [0, 1]]
action_names_1: ["cooperate", "defect"]
action_names_2: ["cooperate", "defect"]
A Nash equilibrium strategy has exploitability ~ 0. A dominated strategy (e.g., AlwaysCooperate in PD) has high exploitability.
CooperationEvaluator
Measures cooperative behavior patterns.
Checks:
- Cooperation rate per player (with thresholds)
- Conditional cooperation: P(C|C) and P(C|D)
- Reciprocity index (cooperation correlation between players)
metrics:
- type: cooperation
weight: 0.5
config:
min_cooperation_rate:
player_0: 0.6
min_reciprocity: 0.3
EquilibriumEvaluator
Measures proximity to Nash equilibrium.
Checks:
- L1 distance to nearest Nash equilibrium
- Equilibrium classification (pure/mixed)
- Convergence detection over time
metrics:
- type: equilibrium
weight: 0.5
config:
max_nash_distance: 0.5
convergence_window: 20
convergence_threshold: 0.1
payoff_matrix_1: [[3, 0], [5, 1]]
payoff_matrix_2: [[3, 5], [0, 1]]
Tournament Mode
Round-Robin
Every agent plays every other agent:
from atp_games import run_round_robin
result = await run_round_robin(
game=game,
agents={"tft": tft_adapter, "allc": allc_adapter, "alld": alld_adapter},
config=GameRunConfig(episodes=20),
)
print(result.standings) # Sorted by total payoff
Single Elimination
from atp_games import run_single_elimination
result = await run_single_elimination(
game=game,
agents=agents,
config=config,
)
print(result.bracket)
print(result.winner)
Double Elimination
from atp_games import run_double_elimination
result = await run_double_elimination(
game=game,
agents=agents,
config=config,
)
Cross-Play Matrix
Run every agent pair (including self-play) and generate a payoff heatmap:
from atp_games import run_cross_play
result = await run_cross_play(
game=game,
agents=agents,
config=config,
)
# result contains per-pair payoff statistics
Stress Testing
Test agent robustness against best-response oracles:
from atp_games import run_stress_test
result = await run_stress_test(
game=game,
agent=agent_adapter,
config=config,
)
print(f"Exploitability under stress: {result.exploitability}")
Architecture
atp_games/
├── models.py # GameResult, EpisodeResult, PlayerStats, comparisons
├── plugin.py # ATP plugin registration
├── mapping/
│ ├── observation_mapper.py # Observation → ATPRequest
│ └── action_mapper.py # ATPResponse → GameAction
├── runner/
│ ├── game_runner.py # GameRunner orchestrator
│ ├── action_validator.py # Validation with retry logic
│ └── builtin_adapter.py # Wraps Strategy as ATP adapter
├── evaluators/
│ ├── payoff_evaluator.py
│ ├── exploitability_evaluator.py
│ ├── cooperation_evaluator.py
│ └── equilibrium_evaluator.py
└── suites/
├── models.py # GameSuiteConfig, GameAgentConfig
├── game_suite_loader.py # YAML parser with inheritance
├── schema.py # JSON Schema validation
├── tournament.py # Round-robin, elimination
├── cross_play.py # Agent comparison matrix
├── stress_test.py # Adversarial testing
└── builtin/ # Built-in suite YAMLs
├── prisoners_dilemma.yaml
└── auction_battery.yaml
Data Flow
YAML Suite → GameSuiteLoader → Game + Agents (from registries)
↓
GameRunner.run_game()
↓
Per-Episode Loop:
Game.observe() → Observation
ObservationMapper → ATPRequest
AgentAdapter.execute() → ATPResponse
ActionMapper → GameAction
ActionValidator → validated action
Game.step(actions) → StepResult
↓
GameResult (aggregated)
↓
Evaluators → EvalResult → Score
Development
cd atp-games
# Install dev dependencies
uv sync --group dev
# Run tests
uv run pytest tests/ -v --cov=atp_games
# Format and lint
uv run ruff format .
uv run ruff check .
License
MIT License -- see the parent project's LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file atp_games-1.0.0.tar.gz.
File metadata
- Download URL: atp_games-1.0.0.tar.gz
- Upload date:
- Size: 81.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
827407ff0833fa9457ffb9c14eb3e0358ebd4c8600225d7b8b936f0e8fe459a3
|
|
| MD5 |
250f4981af70afa198ed9d84119cef08
|
|
| BLAKE2b-256 |
5b0351f6a563a689bdc82603e6e6a4f88947c26d361b056aa4c8e86f4d9dbfdd
|
Provenance
The following attestation bundles were made for atp_games-1.0.0.tar.gz:
Publisher:
atp-games-ci.yml on andrei-shtanakov/atp-platform
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
atp_games-1.0.0.tar.gz -
Subject digest:
827407ff0833fa9457ffb9c14eb3e0358ebd4c8600225d7b8b936f0e8fe459a3 - Sigstore transparency entry: 1224350317
- Sigstore integration time:
-
Permalink:
andrei-shtanakov/atp-platform@e9c9ef7543a0f452d2921c01445f54afb9863b52 -
Branch / Tag:
refs/tags/atp-games-v1.0.0 - Owner: https://github.com/andrei-shtanakov
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
atp-games-ci.yml@e9c9ef7543a0f452d2921c01445f54afb9863b52 -
Trigger Event:
push
-
Statement type:
File details
Details for the file atp_games-1.0.0-py3-none-any.whl.
File metadata
- Download URL: atp_games-1.0.0-py3-none-any.whl
- Upload date:
- Size: 62.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68bf134259c6df8e2e086fd3a45da70012c470e412cb623459cbfb5e922f7186
|
|
| MD5 |
40138205d3db20b62dcaf9cf8c58121c
|
|
| BLAKE2b-256 |
87a447e98b6d0973a7827b7aaba3914a6b3a28ed8793129499cc5e7213810434
|
Provenance
The following attestation bundles were made for atp_games-1.0.0-py3-none-any.whl:
Publisher:
atp-games-ci.yml on andrei-shtanakov/atp-platform
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
atp_games-1.0.0-py3-none-any.whl -
Subject digest:
68bf134259c6df8e2e086fd3a45da70012c470e412cb623459cbfb5e922f7186 - Sigstore transparency entry: 1224350318
- Sigstore integration time:
-
Permalink:
andrei-shtanakov/atp-platform@e9c9ef7543a0f452d2921c01445f54afb9863b52 -
Branch / Tag:
refs/tags/atp-games-v1.0.0 - Owner: https://github.com/andrei-shtanakov
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
atp-games-ci.yml@e9c9ef7543a0f452d2921c01445f54afb9863b52 -
Trigger Event:
push
-
Statement type: