Research platform for studying AI behavior through game scenarios

These details have not been verified by PyPI

Project links

Project description

AgentDeck 🎮

A research platform for studying AI behavior through game scenarios

Status: v0.1.0 (Pre-release) - Core functionality complete, polish in progress Test Coverage: 311 tests passing, ~75% coverage Note: This is a work-in-progress repository. The first public release in the fresh repository will be tagged v0.1.0.

GPT and Gem AI assistants for exploration, development, contribution, and research:

🎯 Purpose & Vision

AgentDeck Overview

AgentDeck is a research platform for studying AI behavior through game scenarios. It enables researchers to run controlled experiments where AI agents interact in well-defined environments, providing comprehensive data collection for analysis of prompting strategies, decision-making patterns, and model capabilities.

Why Games?

Most LLM benchmarks measure knowledge (answering static questions). But real-world utility requires agency: maintaining state, forming strategies, and adapting over time.

Games are the perfect "behavioral wind tunnel" for testing these capabilities:

Constrained environments – Isolate specific variables (e.g., "Does the model understand resource scarcity?")
Iterative decision making – Agents live with consequences, testing long-term planning
Social dynamics – Multiplayer games reveal cooperation, betrayal, and negotiation patterns
Measurable outcomes – Win/lose provides clear signal for cost/quality trade-offs

The Console Metaphor

AgentDeck is architected like a video game console to keep experiments modular and clean:

🎮 Console (AgentDeck) – The engine that orchestrates sessions, manages seeding, and enforces rules
💾 Game (Cartridge) – Pure logic defining rules and state transitions; swap games without changing agents
🤖 Player – The AI agent (GPT-4, Claude, Gemini) that "holds the controller"
🕹️ Controller – Translates the AI's text response into valid game actions
📺 Renderer – "Draws" the game state into text the AI can understand
👁️ Spectator – The audience watching the live stream (stats, narration, cost tracking)
📹 Recorder – The "DVR" capturing every event for perfect replay and analysis

By separating these concerns, AgentDeck ensures your research is reproducible, observable, and easy to modify.

Core Capabilities:

Run experiments with GPT-5, Claude, Gemini in ~10 lines of code
Parallel execution - 10× speedup with worker-based concurrency
Complete observability - every decision, timing, and reasoning captured
Real-time monitoring - live progress tracking with ETA and cost estimates
Perfect replay - reconstruct exact match conditions from recordings
Reproducible research - deterministic experiments via seeded randomness

⚙️ Architecture

AgentDeck follows a gaming console metaphor with clean separation of concerns:

┌─────────────────────────────────────┐
│         AgentDeck (Facade)          │  ← You interact here
├─────────────────────────────────────┤
│         Console (Orchestrator)       │  ← Manages lifecycle
├─────────────┬───────────────────────┤
│    Game     │     EventBus          │  ← Game logic + Events
├─────────────┼───────────────────────┤
│   Players   │     Spectators        │  ← AI agents + Observers
└─────────────┴───────────────────────┘

Single Turn Flow

Core Components

Games define rules and state

Implement 4 methods: setup(), get_view(), update(), status()
State is JSON-serializable dicts (no complex objects)
Example: FixedDamageGame

Players are AI agents making decisions

Three-phase lifecycle: Handshake → Turn → Conclusion
Built-in: GPTPlayer, ClaudePlayer, GeminiPlayer, MockPlayer
Composable prompt templates via PromptBuilder

Controllers parse AI responses into actions

ActionOnlyController - extracts single action token
ReasoningController - extracts reasoning + action
AcceptOKHandshakeController - validates handshake acceptance

Renderers format game state for AI consumption

TextRenderer - human-readable text format
Custom renderers can provide JSON, images, etc.

Spectators observe and analyze matches

MatchNarrator - turn-by-turn commentary
ProgressDisplay - real-time progress with ETA
TokenUsageTracker - cost tracking per player/model
StatsTracker - win rates and performance metrics

Recording & Replay

Recorder - captures complete match data to JSON
ReplayEngine - reconstructs matches with event parity guarantee

🚀 Quick Start

Installation

Source install (recommended for v0.1.0):

# Clone repository
git clone https://github.com/agentdeck/agentdeck.git
cd agentdeck

# Install with dependencies
pip install -e .

# Or install with dev tools
pip install -e ".[dev]"

# Optional provider extras
pip install -e ".[openai]"      # OpenAI SDK
pip install -e ".[anthropic]"   # Anthropic SDK
pip install -e ".[google]"      # Google Vertex SDK
pip install -e ".[providers]"   # All provider SDKs

# Minimal replay-only install (no providers)
pip install -e .

📦 PyPI release coming soon: Once v0.1.0 is ready, install will be pip install agentdeck

Your First Experiment

from agentdeck import AgentDeck, GPTPlayer, FixedDamageGame, ActionOnlyController

# 1. Create a game
game = FixedDamageGame(
    max_health=100,
    attack_damage=20,
    potion_heal=30,
    starting_potions=1,
)

# 2. Create AI players
players = [
    GPTPlayer(
        name="Player-1",
        model="gpt-4o-mini",
        temperature=0.7,
        controller=ActionOnlyController(),
    ),
    GPTPlayer(
        name="Player-2",
        model="gpt-4o-mini",
        temperature=0.7,
        controller=ActionOnlyController(),
    ),
]

# 3. Run experiment
with AgentDeck(game=game) as deck:
    results = deck.play(
        players=players,
        matches=10,
        seed=42,  # Reproducible!
    )

# 4. Analyze results
print(f"Win rates: {results.win_rates}")

ℹ️ API keys required
The built-in LLM players expect environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY). Set the ones you use before running the example.

Try AgentDeck Without API Keys

Run python examples/mock_demo.py
Uses MockPlayer (deterministic) so no LLM providers are needed
Shows live commentary + progress + stats, and saves recordings under agentdeck_runs/mock_demo/<session>/records/

What You’ll See (Artifacts & Output)

Live progress (ProgressDisplay):

[Batch test] Match 2/3 | ETA: 5.1s | Rate: 0.6 matches/sec
[Batch test] Match 3/3 | ETA: 0.0s | Rate: 0.7 matches/sec

Narration and results (MatchNarrator/Stats):

Turn 1: Alice → ATTACK (Bob: 45 HP) | Bob → POTION (65 HP)
Turn 2: Alice → ATTACK (Bob: 50 HP) | Bob → ATTACK (Alice: 45 HP)
Winner: Alice in 4 turns

Recording snippet (agentdeck_runs/.../records/match_001.json):

{
  "match_id": "match_001",
  "seed": 7,
  "events": [
    {"type": "player_handshake_start", "player": "Alice"},
    {"type": "gameplay", "turn_number": 1, "prompt_text": "..."},
    {"type": "match_end", "winner": "Alice", "turns": 4}
  ]
}

Cost/usage summary (TokenUsageTracker):

Total API Calls: 6
Total Tokens: 2,180 (prompt 1,420 | completion 760)
Total Cost: $0.0421

Output:

Configuration:
  Default Game: FixedDamageGame
  Seed: 42

Player Details:
  Player-1:
    Model: gpt-4o-mini
    Controller: ActionOnlyController
  Player-2:
    Model: gpt-4o-mini
    Controller: ActionOnlyController

✓ Player-1 handshake: OK
✓ Player-2 handshake: OK

Match 1/10:
  Turn 1: Player-1 → ATTACK
  Turn 2: Player-2 → ATTACK
  ...
  Winner: Player-1 (11 turns)

Win rates: {'Player-1': 0.6, 'Player-2': 0.4}

Parallel Execution (10× Speedup)

from agentdeck import AgentDeck, AgentDeckConfig
from agentdeck.core.types import LogLevel

# Configure parallel execution with real-time monitoring
config = AgentDeckConfig(
    seed=42,
    concurrency=10,      # Run 10 matches in parallel
    log_level=LogLevel.INFO
)

# Run 100 matches with automatic progress tracking
with AgentDeck(game=game, session=config) as deck:
    results = deck.play(players=players, matches=100)

# ProgressMonitor auto-attached - shows real-time ETA and cost tracking

Output:

[ProgressMonitor] Batch Progress: 10/100 (10.0%) | ETA: 2m 15s | Rate: 4.4 matches/sec
[ProgressMonitor] Batch Progress: 50/100 (50.0%) | ETA: 1m 08s | Rate: 4.6 matches/sec
[ProgressMonitor] Batch Progress: 100/100 (100.0%) | Completed in 2m 52s

Validated Performance: 10.26× speedup with concurrency=10, deterministic replay parity guaranteed.

💡 Key Features

1. Event-Driven Observation

Everything is observable through events - no modifications needed to games:

from agentdeck import AgentDeck
from agentdeck.spectators import MatchNarrator, TokenUsageTracker

# Add spectators for observation
with AgentDeck(game=game, spectators=[
    MatchNarrator(),      # Turn-by-turn commentary
    TokenUsageTracker()   # Cost tracking
]) as deck:
    results = deck.play(players, matches=10)

2. Complete Recording & Replay

Every match is automatically recorded with full metadata:

import json
from pathlib import Path

from agentdeck import AgentDeck, Recorder
from agentdeck.core.replay import ReplayEngine
from agentdeck.spectators import MatchNarrator

# Record matches to JSON
recorder = Recorder(output_dir="agentdeck_records")
with AgentDeck(game=game, spectators=[recorder]) as deck:
    deck.play(players, matches=3, seed=7)

# Load the most recent recording
recording_path = sorted(Path("agentdeck_records").glob("session_*/match_*.json"))[-1]
with recording_path.open("r", encoding="utf-8") as handle:
    match_data = json.load(handle)

# Replay with new spectators (exact parity)
engine = ReplayEngine(match_data)
engine.replay(spectators=[MatchNarrator()], speed=0.0)

Replay Parity Guarantee: Replay emits identical event stream as live execution, including complete three-phase lifecycle (handshake → gameplay → conclusion).

3. Reproducible Experiments

Deterministic seeding ensures exact reproducibility:

# Same seed → same results
with AgentDeck(game=game, seed=42) as deck:
    results1 = deck.play(players, matches=100)
    results2 = deck.play(players, matches=100)

assert results1.win_rates == results2.win_rates

4. Three-Phase Player Lifecycle

Players go through structured interaction phases:

Handshake (Mandatory): Player acknowledges rules and format
Turn (Gameplay): Player makes decisions each turn
Conclusion (Optional): Player reflects on match outcome

This provides rich data for analyzing AI behavior patterns.

📊 What's Actually Implemented

AgentDeck v0.1.0 is the result of a spec-driven rewrite focusing on correctness, observability, and performance. Here's what's ready:

✅ Complete & Tested

Core Execution: Console, EventBus, three-phase lifecycle
Parallel Execution: Worker-based concurrency with deterministic replay parity (10× speedup validated)
Monitor System: Real-time progress tracking with ProgressMonitor (auto-attached for parallel runs)
LLM Integration: GPTPlayer, ClaudePlayer, GeminiPlayer (full lifecycle support with clone())
Controllers: ActionOnlyController, ReasoningController (parser bug fixed), AcceptOKHandshakeController
Renderers: TextRenderer (generic, works with any game)
Games: FixedDamageGame example with information levels
Spectators: MatchNarrator, ProgressDisplay, TokenUsageTracker, StatsTracker
Recording: Recorder with complete metadata capture (parallel-compatible)
Replay: ReplayEngine with full lifecycle parity (R1 guarantee)
Prompt Composition: PromptBuilder with template system
Reproducibility: Deterministic seeding and exact replay (validated in production)
Test Suite: 167 tests passing (66% coverage)

🚧 Coming Soon (See ROADMAP.md)

Research Module: Statistical comparison tools (Phase 2)
Advanced Examples: Auction game, Prisoner's Dilemma
Extension Templates: AI-assisted game/player/spectator creation (Phase 3)
Documentation: Game authoring guide, spectator guide (Phase 3)

🔬 Current Milestone

v0.1.0 (Pre-release): Core Functionality Complete

✅ Worker-based parallel execution with deterministic replay parity (SPEC-PARALLEL v1.0.0)
✅ Monitor system for real-time progress tracking (SPEC-MONITOR v1.0.0)
✅ Production validation: 4 experiments, 40× faster than estimated
✅ 167/167 tests passing, 66% coverage
✅ Validated with OpenAI GPT-4o-mini and GPT-4o

Next: Pre-release polish (packaging, docs, validation) → Public v0.1.0 in fresh repository

🛠️ Development

Running Tests

# Install dependencies
pip install -e ".[dev]"

# Run test suite
pytest

# Run with coverage
pytest --cov=src/agentdeck --cov-report=html

Running Examples

# Set your API key
export OPENAI_API_KEY="sk-..."

# Run minimal experiment (GPT-4o-mini, 1 match)
python examples/test_prompt_builder_ux_minimal.py

# Run replay example
python examples/replay_minimal.py

# See all examples
ls examples/*.py

Project Structure

agentdeck/
├── src/agentdeck/
│   ├── core/                 # Console, EventBus, Recorder, Replay
│   ├── players/              # GPT, Claude, Gemini, Mock
│   ├── controllers/          # ActionOnly, Reasoning, Handshake
│   ├── renderers/            # Text renderer
│   ├── spectators/           # Narrator, Progress, TokenUsage, Stats
│   └── games/examples/       # FixedDamageGame
├── tests/                    # 116 tests (unit + integration)
├── examples/                 # Working examples
└── specs/                    # Component specifications

📚 Documentation

Architecture Spec - High-level system design and component navigation
Component Specs - Detailed specifications for each component
ROADMAP.md - Implementation progress and future plans
Examples - Working code examples

Component Specifications

All components follow rigorous specifications with numbered invariants:

SPEC-GAME.md - Game author contract
SPEC-PLAYER.md - Three-phase player lifecycle
SPEC-CONTROLLER.md - Response parsing
SPEC-RENDERER.md - State formatting
SPEC-SPECTATOR.md - Observation interface
SPEC-RECORDER.md - Match persistence
SPEC-REPLAY.md - Exact replay with parity guarantee
SPEC-CONSOLE.md - Execution engine
SPEC-OBSERVABILITY.md - Event system

🎯 Design Principles

Spec-Driven: Every component has a rigorous specification
Observable: Every decision is captured and analyzable
Reproducible: Deterministic with seeded randomness
Composable: Mix and match components freely
Research-First: Built by researchers, for researchers

📝 License

MIT License - Free for research and commercial use.

Built with ❤️ for AI researchers

AgentDeck v0.1.0 - Spec-Driven Architecture for AI Behavioral Research

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

May 8, 2026

0.1.1

Apr 22, 2026

0.1.0

Apr 22, 2026

0.1.0rc2 pre-release

Dec 13, 2025

This version

0.1.0rc1 pre-release

Nov 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentdeck_ai-0.1.0rc1.tar.gz (165.4 kB view details)

Uploaded Nov 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentdeck_ai-0.1.0rc1-py3-none-any.whl (189.7 kB view details)

Uploaded Nov 24, 2025 Python 3

File details

Details for the file agentdeck_ai-0.1.0rc1.tar.gz.

File metadata

Download URL: agentdeck_ai-0.1.0rc1.tar.gz
Upload date: Nov 24, 2025
Size: 165.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for agentdeck_ai-0.1.0rc1.tar.gz
Algorithm	Hash digest
SHA256	`7353295b8d046a617712ab008a5e889a95bbf641449b12d3e73864dc1fcc7a7f`
MD5	`0f8162705979c76977f6c343d8dbb524`
BLAKE2b-256	`28e1723dffa9e2cc1460f24ab48f45dcf25e883295f279e8d0392f704a9674e6`

See more details on using hashes here.

File details

Details for the file agentdeck_ai-0.1.0rc1-py3-none-any.whl.

File metadata

Download URL: agentdeck_ai-0.1.0rc1-py3-none-any.whl
Upload date: Nov 24, 2025
Size: 189.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for agentdeck_ai-0.1.0rc1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ab54427a5e286cc69407d891e8e43296a15281e9a9475df4705515ade4e2ed92`
MD5	`46b8b9a58a7ae998205d88d4f4da0426`
BLAKE2b-256	`c28a4310ed964379f3514ad6c16e7d35f5945d5843f463ddc7fd694eb5cd1256`

See more details on using hashes here.

agentdeck-ai 0.1.0rc1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AgentDeck 🎮

🎯 Purpose & Vision

Why Games?

The Console Metaphor

⚙️ Architecture

Single Turn Flow

Core Components

🚀 Quick Start

Installation

Your First Experiment

Try AgentDeck Without API Keys

What You’ll See (Artifacts & Output)

Parallel Execution (10× Speedup)

💡 Key Features

1. Event-Driven Observation

2. Complete Recording & Replay

3. Reproducible Experiments

4. Three-Phase Player Lifecycle

📊 What's Actually Implemented

✅ Complete & Tested

🚧 Coming Soon (See ROADMAP.md)

🔬 Current Milestone

🛠️ Development

Running Tests

Running Examples

Project Structure

📚 Documentation

Component Specifications

🎯 Design Principles

📝 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes