Skip to main content

Research platform for studying AI behavior through game scenarios

Project description

AgentDeck ๐ŸŽฎ

A research platform for studying AI behavior through game scenarios

Status: v0.1.0 (Pre-release) - Core functionality complete, polish in progress Test Coverage: 311 tests passing, ~75% coverage Note: This is a work-in-progress repository. The first public release in the fresh repository will be tagged v0.1.0.


GPT and Gem AI assistants for exploration, development, contribution, and research:

GPT Assistant Gemini Gem

๐ŸŽฏ Purpose & Vision

AgentDeck Overview

AgentDeck is a research platform for studying AI behavior through game scenarios. It enables researchers to run controlled experiments where AI agents interact in well-defined environments, providing comprehensive data collection for analysis of prompting strategies, decision-making patterns, and model capabilities.

Why Games?

Most LLM benchmarks measure knowledge (answering static questions). But real-world utility requires agency: maintaining state, forming strategies, and adapting over time.

Games are the perfect "behavioral wind tunnel" for testing these capabilities:

  • Constrained environments โ€“ Isolate specific variables (e.g., "Does the model understand resource scarcity?")
  • Iterative decision making โ€“ Agents live with consequences, testing long-term planning
  • Social dynamics โ€“ Multiplayer games reveal cooperation, betrayal, and negotiation patterns
  • Measurable outcomes โ€“ Win/lose provides clear signal for cost/quality trade-offs

The Console Metaphor

AgentDeck is architected like a video game console to keep experiments modular and clean:

  • ๐ŸŽฎ Console (AgentDeck) โ€“ The engine that orchestrates sessions, manages seeding, and enforces rules
  • ๐Ÿ’พ Game (Cartridge) โ€“ Pure logic defining rules and state transitions; swap games without changing agents
  • ๐Ÿค– Player โ€“ The AI agent (GPT-4, Claude, Gemini) that "holds the controller"
  • ๐Ÿ•น๏ธ Controller โ€“ Translates the AI's text response into valid game actions
  • ๐Ÿ“บ Renderer โ€“ "Draws" the game state into text the AI can understand
  • ๐Ÿ‘๏ธ Spectator โ€“ The audience watching the live stream (stats, narration, cost tracking)
  • ๐Ÿ“น Recorder โ€“ The "DVR" capturing every event for perfect replay and analysis

By separating these concerns, AgentDeck ensures your research is reproducible, observable, and easy to modify.

Core Capabilities:

  • Run experiments with GPT-5, Claude, Gemini in ~10 lines of code
  • Parallel execution - 10ร— speedup with worker-based concurrency
  • Complete observability - every decision, timing, and reasoning captured
  • Real-time monitoring - live progress tracking with ETA and cost estimates
  • Perfect replay - reconstruct exact match conditions from recordings
  • Reproducible research - deterministic experiments via seeded randomness

โš™๏ธ Architecture

AgentDeck follows a gaming console metaphor with clean separation of concerns:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         AgentDeck (Facade)          โ”‚  โ† You interact here
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚         Console (Orchestrator)       โ”‚  โ† Manages lifecycle
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚    Game     โ”‚     EventBus          โ”‚  โ† Game logic + Events
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   Players   โ”‚     Spectators        โ”‚  โ† AI agents + Observers
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Single Turn Flow

Single Turn Flow

Core Components

Games define rules and state

  • Implement 4 methods: setup(), get_view(), update(), status()
  • State is JSON-serializable dicts (no complex objects)
  • Example: FixedDamageGame

Players are AI agents making decisions

  • Three-phase lifecycle: Handshake โ†’ Turn โ†’ Conclusion
  • Built-in: GPTPlayer, ClaudePlayer, GeminiPlayer, MockPlayer
  • Composable prompt templates via PromptBuilder

Controllers parse AI responses into actions

  • ActionOnlyController - extracts single action token
  • ReasoningController - extracts reasoning + action
  • AcceptOKHandshakeController - validates handshake acceptance

Renderers format game state for AI consumption

  • TextRenderer - human-readable text format
  • Custom renderers can provide JSON, images, etc.

Spectators observe and analyze matches

  • MatchNarrator - turn-by-turn commentary
  • ProgressDisplay - real-time progress with ETA
  • TokenUsageTracker - cost tracking per player/model
  • StatsTracker - win rates and performance metrics

Recording & Replay

  • Recorder - captures complete match data to JSON
  • ReplayEngine - reconstructs matches with event parity guarantee

๐Ÿš€ Quick Start

Installation

Source install (recommended for v0.1.0):

# Clone repository
git clone https://github.com/agentdeck/agentdeck.git
cd agentdeck

# Install with dependencies
pip install -e .

# Or install with dev tools
pip install -e ".[dev]"

# Optional provider extras
pip install -e ".[openai]"      # OpenAI SDK
pip install -e ".[anthropic]"   # Anthropic SDK
pip install -e ".[google]"      # Google Vertex SDK
pip install -e ".[providers]"   # All provider SDKs

# Minimal replay-only install (no providers)
pip install -e .

๐Ÿ“ฆ PyPI release coming soon: Once v0.1.0 is ready, install will be pip install agentdeck

Your First Experiment

from agentdeck import AgentDeck, GPTPlayer, FixedDamageGame, ActionOnlyController

# 1. Create a game
game = FixedDamageGame(
    max_health=100,
    attack_damage=20,
    potion_heal=30,
    starting_potions=1,
)

# 2. Create AI players
players = [
    GPTPlayer(
        name="Player-1",
        model="gpt-4o-mini",
        temperature=0.7,
        controller=ActionOnlyController(),
    ),
    GPTPlayer(
        name="Player-2",
        model="gpt-4o-mini",
        temperature=0.7,
        controller=ActionOnlyController(),
    ),
]

# 3. Run experiment
with AgentDeck(game=game) as deck:
    results = deck.play(
        players=players,
        matches=10,
        seed=42,  # Reproducible!
    )

# 4. Analyze results
print(f"Win rates: {results.win_rates}")

โ„น๏ธ API keys required
The built-in LLM players expect environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY). Set the ones you use before running the example.

Try AgentDeck Without API Keys

  • Run python examples/mock_demo.py
  • Uses MockPlayer (deterministic) so no LLM providers are needed
  • Shows live commentary + progress + stats, and saves recordings under agentdeck_runs/mock_demo/<session>/records/

What Youโ€™ll See (Artifacts & Output)

  • Live progress (ProgressDisplay):
    [Batch test] Match 2/3 | ETA: 5.1s | Rate: 0.6 matches/sec
    [Batch test] Match 3/3 | ETA: 0.0s | Rate: 0.7 matches/sec
    
  • Narration and results (MatchNarrator/Stats):
    Turn 1: Alice โ†’ ATTACK (Bob: 45 HP) | Bob โ†’ POTION (65 HP)
    Turn 2: Alice โ†’ ATTACK (Bob: 50 HP) | Bob โ†’ ATTACK (Alice: 45 HP)
    Winner: Alice in 4 turns
    
  • Recording snippet (agentdeck_runs/.../records/match_001.json):
    {
      "match_id": "match_001",
      "seed": 7,
      "events": [
        {"type": "player_handshake_start", "player": "Alice"},
        {"type": "gameplay", "turn_number": 1, "prompt_text": "..."},
        {"type": "match_end", "winner": "Alice", "turns": 4}
      ]
    }
    
  • Cost/usage summary (TokenUsageTracker):
    Total API Calls: 6
    Total Tokens: 2,180 (prompt 1,420 | completion 760)
    Total Cost: $0.0421
    

Output:

Configuration:
  Default Game: FixedDamageGame
  Seed: 42

Player Details:
  Player-1:
    Model: gpt-4o-mini
    Controller: ActionOnlyController
  Player-2:
    Model: gpt-4o-mini
    Controller: ActionOnlyController

โœ“ Player-1 handshake: OK
โœ“ Player-2 handshake: OK

Match 1/10:
  Turn 1: Player-1 โ†’ ATTACK
  Turn 2: Player-2 โ†’ ATTACK
  ...
  Winner: Player-1 (11 turns)

Win rates: {'Player-1': 0.6, 'Player-2': 0.4}

Parallel Execution (10ร— Speedup)

from agentdeck import AgentDeck, AgentDeckConfig
from agentdeck.core.types import LogLevel

# Configure parallel execution with real-time monitoring
config = AgentDeckConfig(
    seed=42,
    concurrency=10,      # Run 10 matches in parallel
    log_level=LogLevel.INFO
)

# Run 100 matches with automatic progress tracking
with AgentDeck(game=game, session=config) as deck:
    results = deck.play(players=players, matches=100)

# ProgressMonitor auto-attached - shows real-time ETA and cost tracking

Output:

[ProgressMonitor] Batch Progress: 10/100 (10.0%) | ETA: 2m 15s | Rate: 4.4 matches/sec
[ProgressMonitor] Batch Progress: 50/100 (50.0%) | ETA: 1m 08s | Rate: 4.6 matches/sec
[ProgressMonitor] Batch Progress: 100/100 (100.0%) | Completed in 2m 52s

Validated Performance: 10.26ร— speedup with concurrency=10, deterministic replay parity guaranteed.


๐Ÿ’ก Key Features

1. Event-Driven Observation

Everything is observable through events - no modifications needed to games:

from agentdeck import AgentDeck
from agentdeck.spectators import MatchNarrator, TokenUsageTracker

# Add spectators for observation
with AgentDeck(game=game, spectators=[
    MatchNarrator(),      # Turn-by-turn commentary
    TokenUsageTracker()   # Cost tracking
]) as deck:
    results = deck.play(players, matches=10)

2. Complete Recording & Replay

Every match is automatically recorded with full metadata:

import json
from pathlib import Path

from agentdeck import AgentDeck, Recorder
from agentdeck.core.replay import ReplayEngine
from agentdeck.spectators import MatchNarrator

# Record matches to JSON
recorder = Recorder(output_dir="agentdeck_records")
with AgentDeck(game=game, spectators=[recorder]) as deck:
    deck.play(players, matches=3, seed=7)

# Load the most recent recording
recording_path = sorted(Path("agentdeck_records").glob("session_*/match_*.json"))[-1]
with recording_path.open("r", encoding="utf-8") as handle:
    match_data = json.load(handle)

# Replay with new spectators (exact parity)
engine = ReplayEngine(match_data)
engine.replay(spectators=[MatchNarrator()], speed=0.0)

Replay Parity Guarantee: Replay emits identical event stream as live execution, including complete three-phase lifecycle (handshake โ†’ gameplay โ†’ conclusion).

3. Reproducible Experiments

Deterministic seeding ensures exact reproducibility:

# Same seed โ†’ same results
with AgentDeck(game=game, seed=42) as deck:
    results1 = deck.play(players, matches=100)
    results2 = deck.play(players, matches=100)

assert results1.win_rates == results2.win_rates

4. Three-Phase Player Lifecycle

Players go through structured interaction phases:

  1. Handshake (Mandatory): Player acknowledges rules and format
  2. Turn (Gameplay): Player makes decisions each turn
  3. Conclusion (Optional): Player reflects on match outcome

This provides rich data for analyzing AI behavior patterns.


๐Ÿ“Š What's Actually Implemented

AgentDeck v0.1.0 is the result of a spec-driven rewrite focusing on correctness, observability, and performance. Here's what's ready:

โœ… Complete & Tested

  • Core Execution: Console, EventBus, three-phase lifecycle
  • Parallel Execution: Worker-based concurrency with deterministic replay parity (10ร— speedup validated)
  • Monitor System: Real-time progress tracking with ProgressMonitor (auto-attached for parallel runs)
  • LLM Integration: GPTPlayer, ClaudePlayer, GeminiPlayer (full lifecycle support with clone())
  • Controllers: ActionOnlyController, ReasoningController (parser bug fixed), AcceptOKHandshakeController
  • Renderers: TextRenderer (generic, works with any game)
  • Games: FixedDamageGame example with information levels
  • Spectators: MatchNarrator, ProgressDisplay, TokenUsageTracker, StatsTracker
  • Recording: Recorder with complete metadata capture (parallel-compatible)
  • Replay: ReplayEngine with full lifecycle parity (R1 guarantee)
  • Prompt Composition: PromptBuilder with template system
  • Reproducibility: Deterministic seeding and exact replay (validated in production)
  • Test Suite: 167 tests passing (66% coverage)

๐Ÿšง Coming Soon (See ROADMAP.md)

  • Research Module: Statistical comparison tools (Phase 2)
  • Advanced Examples: Auction game, Prisoner's Dilemma
  • Extension Templates: AI-assisted game/player/spectator creation (Phase 3)
  • Documentation: Game authoring guide, spectator guide (Phase 3)

๐Ÿ”ฌ Current Milestone

v0.1.0 (Pre-release): Core Functionality Complete

  • โœ… Worker-based parallel execution with deterministic replay parity (SPEC-PARALLEL v1.0.0)
  • โœ… Monitor system for real-time progress tracking (SPEC-MONITOR v1.0.0)
  • โœ… Production validation: 4 experiments, 40ร— faster than estimated
  • โœ… 167/167 tests passing, 66% coverage
  • โœ… Validated with OpenAI GPT-4o-mini and GPT-4o

Next: Pre-release polish (packaging, docs, validation) โ†’ Public v0.1.0 in fresh repository


๐Ÿ› ๏ธ Development

Running Tests

# Install dependencies
pip install -e ".[dev]"

# Run test suite
pytest

# Run with coverage
pytest --cov=src/agentdeck --cov-report=html

Running Examples

# Set your API key
export OPENAI_API_KEY="sk-..."

# Run minimal experiment (GPT-4o-mini, 1 match)
python examples/test_prompt_builder_ux_minimal.py

# Run replay example
python examples/replay_minimal.py

# See all examples
ls examples/*.py

Project Structure

agentdeck/
โ”œโ”€โ”€ src/agentdeck/
โ”‚   โ”œโ”€โ”€ core/                 # Console, EventBus, Recorder, Replay
โ”‚   โ”œโ”€โ”€ players/              # GPT, Claude, Gemini, Mock
โ”‚   โ”œโ”€โ”€ controllers/          # ActionOnly, Reasoning, Handshake
โ”‚   โ”œโ”€โ”€ renderers/            # Text renderer
โ”‚   โ”œโ”€โ”€ spectators/           # Narrator, Progress, TokenUsage, Stats
โ”‚   โ””โ”€โ”€ games/examples/       # FixedDamageGame
โ”œโ”€โ”€ tests/                    # 116 tests (unit + integration)
โ”œโ”€โ”€ examples/                 # Working examples
โ””โ”€โ”€ specs/                    # Component specifications

๐Ÿ“š Documentation

Component Specifications

All components follow rigorous specifications with numbered invariants:


๐ŸŽฏ Design Principles

  1. Spec-Driven: Every component has a rigorous specification
  2. Observable: Every decision is captured and analyzable
  3. Reproducible: Deterministic with seeded randomness
  4. Composable: Mix and match components freely
  5. Research-First: Built by researchers, for researchers

๐Ÿ“ License

MIT License - Free for research and commercial use.


Built with โค๏ธ for AI researchers

AgentDeck v0.1.0 - Spec-Driven Architecture for AI Behavioral Research

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentdeck_ai-0.1.0rc1.tar.gz (165.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentdeck_ai-0.1.0rc1-py3-none-any.whl (189.7 kB view details)

Uploaded Python 3

File details

Details for the file agentdeck_ai-0.1.0rc1.tar.gz.

File metadata

  • Download URL: agentdeck_ai-0.1.0rc1.tar.gz
  • Upload date:
  • Size: 165.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for agentdeck_ai-0.1.0rc1.tar.gz
Algorithm Hash digest
SHA256 7353295b8d046a617712ab008a5e889a95bbf641449b12d3e73864dc1fcc7a7f
MD5 0f8162705979c76977f6c343d8dbb524
BLAKE2b-256 28e1723dffa9e2cc1460f24ab48f45dcf25e883295f279e8d0392f704a9674e6

See more details on using hashes here.

File details

Details for the file agentdeck_ai-0.1.0rc1-py3-none-any.whl.

File metadata

  • Download URL: agentdeck_ai-0.1.0rc1-py3-none-any.whl
  • Upload date:
  • Size: 189.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for agentdeck_ai-0.1.0rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 ab54427a5e286cc69407d891e8e43296a15281e9a9475df4705515ade4e2ed92
MD5 46b8b9a58a7ae998205d88d4f4da0426
BLAKE2b-256 c28a4310ed964379f3514ad6c16e7d35f5945d5843f463ddc7fd694eb5cd1256

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page