The game console for AI agents: behavioral research through game scenarios
Project description
AgentDeck 🎮
The game console for AI agents.
A research platform for analyzing AI agent behavior through game scenarios.
Why Games? · Quick Start · AI-First · Docs · Examples · Research · Specs
🎯 Purpose & Vision
AgentDeck helps you turn a behavioral question into a concrete study: define a game or reuse an existing one, run seeded matches across models and controllers, replay every decision, and export artifacts you can validate and compare.
It is useful when static prompt-response evaluation is not enough. By putting agents inside structured games, AgentDeck makes state, incentives, and resource tradeoffs explicit so behavior is easier to observe, compare, replay, and explain.
🚦 Current Capabilities
AgentDeck currently supports:
- Core match execution through the
AgentDeckfacade - Provider-backed and mock-player experiments
- Recording, replay, and event-driven observability
- Native fairness controls for paired side-swap and explicit first-player policies
- Matrix-based research packages
- Research export, artifact validation, behavioral profiles, and post-hoc analysis workflows
- Curated replay/viewer workflows for selected studies
Spec-Driven and AI-First by Design
AgentDeck is human-led and AI-written: a codebase built with AI agents, designed for humans and AI agents, and validated through tests, replayable experiments, research artifacts, and blind QA rounds performed by autonomous agents.
Specs are the source of truth. They define intent, contracts, boundaries, and expected behavior. Code, tests, docs, examples, and research workflows derive from that specification layer and are validated through execution.
AgentDeck is therefore designed to be legible to both humans and AI agents, treating AI agents as first-class users, contributors, evaluators, and research operators.
🎮 Why Games?
Most LLM benchmarks measure knowledge through static questions. AgentDeck focuses on behavior: maintaining state, adapting over time, and making tradeoffs inside explicit rules.
Game scenarios work well because they make the important variables legible:
- Constrained environments – Isolate specific variables (for example, resource scarcity or turn order)
- Iterative decision making – Agents live with consequences, testing longer-horizon behavior
- Social dynamics – Multiplayer games reveal cooperation, betrayal, and negotiation patterns
- Measurable outcomes – Win/lose provides a clean signal for cost/quality trade-offs
🔎 Flagship Evidence
The Agentic Edge study uses AgentDeck to test whether agent design can overcome model-tier gaps in sequential decision games.
In FixedDamage, the same lower-tier model moves from failure to a tier inversion as the agent wrapper changes:
| Agent configuration | Opponent | Result |
|---|---|---|
| FlashLite S0 action-only | GPT-4o-mini S0 action-only | 0/48 wins (0.0%) |
| FlashLite S1 reasoning controller | GPT-4o-mini S0 action-only | 34/48 wins (70.8%) |
| FlashLite S3 reasoning + HP grounding | GPT-4o-mini S0 action-only | 38/48 wins (79.2%) |
The VariableDamage transfer result is more cautious: the adapted risk-grounded stack wins its same-model mechanism test, but the cross-tier result is seat-sensitive and not statistically strong. That caveat is the point: AgentDeck is built to expose behavior, not hide messy evidence.
🚀 Quick Start
Install:
pip install agentdeck-ai(import asagentdeck)AI-first prompt: Ask Claude, Codex, or your coding agent: “Learn AgentDeck from the README, create a tiny tic-tac-toe game, run a few matches, then analyze the recorded behavior.”
Installation
PyPI install (recommended):
# Latest release on PyPI
pip install agentdeck-ai
# With provider SDKs
pip install agentdeck-ai[openai] # OpenAI SDK
pip install agentdeck-ai[anthropic] # Anthropic SDK
pip install agentdeck-ai[google] # Google Gen AI SDK (Vertex mode)
pip install agentdeck-ai[providers] # All provider SDKs
# With research stack (statistics/plotting)
pip install agentdeck-ai[research]
# Development install
pip install agentdeck-ai[dev]
Source install (for contributors):
git clone https://github.com/agentdeck/agentdeck.git
cd agentdeck
pip install -e ".[dev]"
Your First Experiment
from agentdeck import (
ActionOnlyController,
AgentDeck,
FixedDamageGame,
GPTPlayer,
ReasoningController,
)
# 1. Create a game
game = FixedDamageGame(
max_health=100,
attack_damage=20,
potion_heal=30,
starting_potions=3,
information_level="full", # use "partial" to hide opponent HP/potions
)
# 2. Create AI players: same model, different behavioral interface
players = [
GPTPlayer(
name="SameModel-AO",
model="gpt-4o-mini",
temperature=0.7,
controller=ActionOnlyController(),
),
GPTPlayer(
name="SameModel-RC",
model="gpt-4o-mini",
temperature=0.7,
controller=ReasoningController(),
),
]
# Models must be provided explicitly for every provider-backed player.
# 3. Run experiment
with AgentDeck(game=game) as deck:
results = deck.play(
players=players,
matches=1,
seed=42, # Reproducible!
)
# 4. Analyze results
print(f"Win rates: {results.win_rates}")
🔒 Models are explicit
Provider-backed players never fall back to defaults; passmodel=for every GPT/Claude/Gemini player.ℹ️ Provider credentials
Set the provider-specific environment variables before running examples (OPENAI_API_KEY,ANTHROPIC_API_KEY, andVERTEX_PROJECT_ID/VERTEX_LOCATIONfor Gemini). For Gemini on Vertex, AgentDeck also supportsGOOGLE_APPLICATION_CREDENTIALS_B64for base64-encoded service-account JSON. Start from.env.examplefor local setup.
📝
.envloading policy
AgentDeck does not auto-load.envat the library level. Source it in your shell or load it in your entry script. Inbash/zsh, a simple local setup is:set -a; source .env; set +a
✅ First real provider-backed run Start with
matches=1so you can confirm credentials, recordings, and replay before scaling up.
🎮 FixedDamageGame information level
information_level="full"shows both players' HP and potion counts.information_level="partial"hides the opponent's HP and potions while still showing last actions.
Try AgentDeck Without API Keys
- Run
python examples/mock_demo.py - Uses
MockPlayer(deterministic) so no LLM providers are needed - Shows live reporting + progress + stats, and saves recordings under
agentdeck_runs/mock_demo/<session>/records/
Recommended Learning Path
examples/mock_demo.py— verify the install with a zero-provider runexamples/first_game_walkthrough.py— build a tiny game and replay itexamples/minimal_experiment.py— run the smallest real provider-backed experimentexamples/spectator_example.pyandexamples/replay_minimal.py— add monitoring and replay workflows
For the full ladder, see examples/README.md.
Walkthroughs & Docs
- Build your first game + replay tour:
examples/first_game_walkthrough.py - Examples index: examples/README.md
- End-to-end study workflow: docs/how-to-run-a-study.md
- Package-owned behavioral scoring: keep
scripts/behavioral_scorer.pyin your research package and runagentdeck-research-scoreafter export to populate the targetedresults.json.behavioral_profile(artifacts/<cell>/results.jsonfor matrix studies, top-levelresults.jsonfor direct packages)
Artifacts (Recordings + Logs)
After you run a batch, AgentDeck writes artifacts under agentdeck_runs/<session_id>/ (or your
configured run_dir):
records/contains abatch_<batch_id>.jsonsummary plus onematch_*.jsonper matchlogs/containsinfo.loganddebug.logby default
Tip: open batch_<batch_id>.json first for the high-level batch summary, then open match_*.json
for the full audit trail, replay source, prompts, raw responses, parsed actions, costs, and event
timeline.
Parallel Execution (Workload-Dependent Speedups)
from agentdeck import AgentDeck, AgentDeckConfig
from agentdeck import LogLevel
# Configure parallel execution with real-time monitoring
config = AgentDeckConfig(
seed=42,
concurrency=10, # Run 10 matches in parallel
log_level=LogLevel.INFO
)
# Run 100 matches with automatic progress tracking
with AgentDeck(game=game, session=config) as deck:
results = deck.play(players=players, matches=100)
# ProgressMonitor is auto-attached when concurrency > 1 (unless monitors=[] is provided)
Performance depends on provider rate limits and workload. For a determinism + concurrency comparison, see
examples/test_parallel_execution.py.
🔬 Research Program
This repo ships release-facing benchmark packages, arc summaries, and a cross-game synthesis layer alongside the engine.
Start here:
- The Agentic Edge - Flagship study: strategy-stack effects, FixedDamage tier inversion, VariableDamage caveats, and public replay artifacts
- How To Run A Study - Supported end-to-end workflow for creating, running, exporting, and validating a study
Supporting arcs:
- FixedDamage Arc 1 - Deterministic flagship arc: diagnosis, intervention ladder, and final carry-forward stack
- VariableDamage Arc 1 - Uncertainty arc: risk-band metrics, transfer failures, and premium ceiling check
- Cross-Game Comparison 1 - What transferred, what broke, and why the metrics had to evolve
Deeper references:
- Research Guide - How experiment packages are organized
- Research Index - Registry of experiments and status
- Research Schema - Contract for manifests, results, and validation
- Research Templates - Boilerplate for new experiment packages
⚙️ Architecture
The Console Metaphor
AgentDeck follows a gaming console metaphor with clean separation of concerns:
┌─────────────────────────────────────┐
│ AgentDeck (Facade) │ ← You interact here
├─────────────────────────────────────┤
│ Console (Orchestrator) │ ← Manages lifecycle
├─────────────┬───────────────────────┤
│ Game │ EventBus │ ← Game logic + Events
├─────────────┼───────────────────────┤
│ Players │ Spectators │ ← AI agents + Observers
└─────────────┴───────────────────────┘
Single Turn Flow
Core Components
Games define rules and state
- Required properties:
instructions,allowed_actions,default_handshake_template - Core methods:
setup(),get_view(),update(),status() - State is JSON-serializable dicts (no complex objects)
- Examples: FixedDamageGame and ArchivistChoiceGame
Players are AI agents making decisions
- Three-phase lifecycle: Handshake → Turn → Conclusion
- Built-in:
GPTPlayer,ClaudePlayer,GeminiPlayer,MockPlayer - Composable prompt templates via
PromptBuilder
Controllers parse AI responses into actions
ActionOnlyController- extracts single action tokenReasoningController- extracts reasoning + action- Handshake validation is built into the base
Controller(default accepts exactlyOK)
Renderers format game state for AI consumption
TextRenderer- human-readable text format- Custom renderers can provide JSON, images, etc.
Spectators observe and analyze matches
MatchReporter- turn-by-turn reportingMatchCurator- sidecar metadata for replay viewer curationProgressDisplay- real-time progress with ETATokenUsageTracker- cost tracking per player/modelStatsTracker- win rates and performance metrics
Recording & Replay
Recorder- captures complete match data to JSONReplayEngine- reconstructs matches with event parity guarantee
💡 Key Features
1. Event-Driven Observation
Everything is observable through events - no modifications needed to games:
from agentdeck import AgentDeck
from agentdeck.spectators import MatchReporter, TokenUsageTracker
# Add spectators for observation
with AgentDeck(game=game, spectators=[
MatchReporter(), # Turn-by-turn reporting
TokenUsageTracker() # Cost tracking
]) as deck:
results = deck.play(players, matches=1)
2. Complete Recording & Replay
Every match is automatically recorded with full metadata:
from pathlib import Path
from agentdeck import AgentDeck, MatchReporter
with AgentDeck(game=game) as deck:
results = deck.play(players, matches=3, seed=7)
# Replay from memory (no file I/O)
deck.replay(match=results[0], spectators=[MatchReporter()], speed=0.0)
# Or replay from disk (recorded under records/)
record_dir = Path(deck.session.record_directory)
match_path = sorted(record_dir.glob("match_*.json"))[0]
deck.replay(path=match_path, spectators=[MatchReporter()], speed=0.0)
Replay Parity Guarantee: Replay emits identical event stream as live execution, including complete three-phase lifecycle (handshake → gameplay → conclusion).
3. Reproducible Experiments
Seeding makes game-level randomness reproducible (player ordering, RNG) and guarantees recording/replay parity. However, LLM outputs are not guaranteed to be deterministic across runs, even with a fixed seed.
from agentdeck import AgentDeck, AgentDeckConfig, MockPlayer
config = AgentDeckConfig(seed=42)
players = [
MockPlayer(name="Alice", actions=["ATTACK", "POTION"]),
MockPlayer(name="Bob", actions=["POTION", "ATTACK"]),
]
with AgentDeck(game=game, session=config) as deck:
results = deck.play(players=players, matches=10)
4. Three-Phase Player Lifecycle
Players go through structured interaction phases:
- Handshake (Mandatory): Player acknowledges rules and format
- Turn (Gameplay): Player makes decisions each turn
- Conclusion (Optional): Player reflects on match outcome
This provides rich data for analyzing AI behavior patterns.
📚 Documentation
- Documentation Index - Main docs entry point
- CONTRIBUTING.md - Workflow, local setup, tests
- Specs - Specification index (source of truth)
- Examples - Runnable examples and tutorials
- Security Policy - Vulnerability reporting process
AI Assistants
Project assistants for exploration, development, and research:
🎯 Design Principles
- Spec-Driven: Every component has a rigorous specification
- Observable: Every decision is captured and analyzable
- Reproducible: Everything we control is reproducible (seeding + recordings + replay parity)
- Composable: Mix and match components freely
- Research-First: Built by researchers, for researchers
📝 License
MIT License (see LICENSE).
Built with ❤️ for AI researchers
Spec-Driven Architecture for AI Behavioral Research
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentdeck_ai-0.1.2.tar.gz.
File metadata
- Download URL: agentdeck_ai-0.1.2.tar.gz
- Upload date:
- Size: 244.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1d152c0c5c2233e7418db342e3596c779b9a0caaf3f6ab50a5f95b16dd03c99
|
|
| MD5 |
b2cc7d208cc7ab2ab38e9ea88abdff72
|
|
| BLAKE2b-256 |
8e9de838c0d23079d5f26ad39a85ede961650a6be7d888e9f5fb10ccf28bd75d
|
File details
Details for the file agentdeck_ai-0.1.2-py3-none-any.whl.
File metadata
- Download URL: agentdeck_ai-0.1.2-py3-none-any.whl
- Upload date:
- Size: 274.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d713efc336e508976ed356e65febcb56aea7676f6ae88573e8c5ec59ee8ae5f9
|
|
| MD5 |
8b6da5f8e71456e3eabcea21ce21e0ec
|
|
| BLAKE2b-256 |
6b7a2ab49c919ac4125b748b2a077fd1bf0f91f05b43cd2c6b20b78a7b9b083a
|