The game console for AI agents: behavioral research through game scenarios

These details have not been verified by PyPI

Project links

Project description

AgentDeck 🎮

The game console for AI agents.

A research platform for analyzing AI agent behavior through game scenarios.

Why Games? · Replay · Quick Start · Research · Docs · Examples · Specs · AI-First

🎯 Purpose & Vision

AgentDeck helps you turn a behavioral question into a concrete study: define a game or reuse an existing one, run seeded matches across models and controllers, replay every decision, and export artifacts you can validate and compare.

It is useful when static prompt-response evaluation is not enough. By putting agents inside structured games, AgentDeck makes state, incentives, and resource tradeoffs explicit so behavior is easier to observe, compare, replay, and explain.

AgentDeck Overview

🎬 Run, Record, Replay

Run AI-agent matches from Python, record every turn as structured artifacts, then replay the decisions in a browser viewer for inspection and storytelling.

AgentDeck CLI and Replay Viewer

🎮 Why Games?

Most LLM benchmarks measure knowledge through static questions. AgentDeck focuses on behavior: maintaining state, adapting over time, and making tradeoffs inside explicit rules.

Game scenarios work well because they make the important variables legible:

Constrained environments – Isolate specific variables (for example, resource scarcity or turn order)
Iterative decision making – Agents live with consequences, testing longer-horizon behavior
Social dynamics – Multiplayer games reveal cooperation, betrayal, and negotiation patterns
Measurable outcomes – Win/lose provides a clean signal for cost/quality trade-offs

🔎 Flagship Evidence

The Agentic Edge study uses AgentDeck to test whether agent design can overcome model-tier gaps in sequential decision games.

In FixedDamage, the same lower-tier model moves from failure to a tier inversion as the agent wrapper changes:

Agent configuration	Opponent	Result
FlashLite S0 action-only	GPT-4o-mini S0 action-only	0/48 wins (0.0%)
FlashLite S1 reasoning controller	GPT-4o-mini S0 action-only	34/48 wins (70.8%)
FlashLite S3 reasoning + HP grounding	GPT-4o-mini S0 action-only	38/48 wins (79.2%)

The VariableDamage transfer result is more cautious: the adapted risk-grounded stack wins its same-model mechanism test, but the cross-tier result is seat-sensitive and not statistically strong. That caveat is the point: AgentDeck is built to expose behavior, not hide messy evidence.

Study artifacts are mirrored on Hugging Face: dataset + recordings · curated replay viewer

🚀 Quick Start

Install: pip install agentdeck-ai (import as agentdeck)

AI-first prompt: Ask Claude, Codex, or your coding agent: “Learn AgentDeck from the README, create a tiny tic-tac-toe game, run a few matches, then analyze the recorded behavior.”

Installation

PyPI install (recommended):

# Latest release on PyPI
pip install agentdeck-ai

# With provider SDKs
pip install agentdeck-ai[openai]      # OpenAI SDK
pip install agentdeck-ai[anthropic]   # Anthropic SDK
pip install agentdeck-ai[google]      # Google Gen AI SDK (Vertex mode)
pip install agentdeck-ai[providers]   # All provider SDKs

# With research stack (statistics/plotting)
pip install agentdeck-ai[research]

# Development install
pip install agentdeck-ai[dev]

Source install (for contributors):

git clone https://github.com/agentdeck/agentdeck-core.git
cd agentdeck
pip install -e ".[dev]"

Your First Experiment

from agentdeck import (
    ActionOnlyController,
    AgentDeck,
    FixedDamageGame,
    GPTPlayer,
    ReasoningController,
)

# 1. Create a game
game = FixedDamageGame(
    max_health=100,
    attack_damage=20,
    potion_heal=30,
    starting_potions=3,
    information_level="full",  # use "partial" to hide opponent HP/potions
)

# 2. Create AI players: same model, different behavioral interface
players = [
    GPTPlayer(
        name="SameModel-AO",
        model="gpt-4o-mini",
        temperature=0.7,
        controller=ActionOnlyController(),
    ),
    GPTPlayer(
        name="SameModel-RC",
        model="gpt-4o-mini",
        temperature=0.7,
        controller=ReasoningController(),
    ),
]

# Models must be provided explicitly for every provider-backed player.

# 3. Run experiment
with AgentDeck(game=game) as deck:
    results = deck.play(
        players=players,
        matches=1,
        seed=42,  # Reproducible!
    )

# 4. Analyze results
print(f"Win rates: {results.win_rates}")

🔒 Models are explicit
Provider-backed players never fall back to defaults; pass model= for every GPT/Claude/Gemini player.

ℹ️ Provider credentials
Set the provider-specific environment variables before running examples (OPENAI_API_KEY, ANTHROPIC_API_KEY, and VERTEX_PROJECT_ID/VERTEX_LOCATION for Gemini). For Gemini on Vertex, AgentDeck also supports GOOGLE_APPLICATION_CREDENTIALS_B64 for base64-encoded service-account JSON. Start from .env.example for local setup.

📝 .env loading policy
AgentDeck does not auto-load .env at the library level. Source it in your shell or load it in your entry script. In bash/zsh, a simple local setup is: set -a; source .env; set +a

✅ First real provider-backed run Start with matches=1 so you can confirm credentials, recordings, and replay before scaling up.

🎮 FixedDamageGame information level information_level="full" shows both players' HP and potion counts. information_level="partial" hides the opponent's HP and potions while still showing last actions.

Try AgentDeck Without API Keys

Run python examples/mock_demo.py
Uses MockPlayer (deterministic) so no LLM providers are needed
Shows live reporting + progress + stats, and saves recordings under agentdeck_runs/mock_demo/<session>/records/

Recommended Learning Path

examples/mock_demo.py — verify the install with a zero-provider run
examples/first_game_walkthrough.py — build a tiny game and replay it
examples/minimal_experiment.py — run the smallest real provider-backed experiment
examples/spectator_example.py and examples/replay_minimal.py — add monitoring and replay workflows

For the full ladder, see examples/README.md.

Walkthroughs & Docs

Build your first game + replay tour: examples/first_game_walkthrough.py
Examples index: examples/README.md
End-to-end study workflow: docs/how-to-run-a-study.md
Package-owned behavioral scoring: keep scripts/behavioral_scorer.py in your research package and run agentdeck-research-score after export to populate the targeted results.json.behavioral_profile (artifacts/<cell>/results.json for matrix studies, top-level results.json for direct packages)

Artifacts (Recordings + Logs)

After you run a batch, AgentDeck writes artifacts under agentdeck_runs/<session_id>/ (or your configured run_dir):

records/ contains a batch_<batch_id>.json summary plus one match_*.json per match
logs/ contains info.log and debug.log by default

Tip: open batch_<batch_id>.json first for the high-level batch summary, then open match_*.json for the full audit trail, replay source, prompts, raw responses, parsed actions, costs, and event timeline.

Parallel Execution (Workload-Dependent Speedups)

from agentdeck import AgentDeck, AgentDeckConfig
from agentdeck import LogLevel

# Configure parallel execution with real-time monitoring
config = AgentDeckConfig(
    seed=42,
    concurrency=10,      # Run 10 matches in parallel
    log_level=LogLevel.INFO
)

# Run 100 matches with automatic progress tracking
with AgentDeck(game=game, session=config) as deck:
    results = deck.play(players=players, matches=100)

# ProgressMonitor is auto-attached when concurrency > 1 (unless monitors=[] is provided)

Performance depends on provider rate limits and workload. For a determinism + concurrency comparison, see examples/test_parallel_execution.py.

🔬 Research Program

This repo ships release-facing benchmark packages, arc summaries, and a cross-game synthesis layer alongside the engine.

Start here:

The Agentic Edge - Flagship study: strategy-stack effects, FixedDamage tier inversion, VariableDamage caveats, and public replay artifacts
How To Run A Study - Supported end-to-end workflow for creating, running, exporting, and validating a study

Supporting arcs:

FixedDamage Arc 1 - Deterministic flagship arc: diagnosis, intervention ladder, and final carry-forward stack
VariableDamage Arc 1 - Uncertainty arc: risk-band metrics, transfer failures, and premium ceiling check
Cross-Game Comparison 1 - What transferred, what broke, and why the metrics had to evolve

Deeper references:

Research Guide - How experiment packages are organized
Research Index - Registry of experiments and status
Research Schema - Contract for manifests, results, and validation
Research Templates - Boilerplate for new experiment packages

⚙️ Architecture

The Console Metaphor

AgentDeck follows a gaming console metaphor with clean separation of concerns:

┌─────────────────────────────────────┐
│         AgentDeck (Facade)          │  ← You interact here
├─────────────────────────────────────┤
│         Console (Orchestrator)       │  ← Manages lifecycle
├─────────────┬───────────────────────┤
│    Game     │     EventBus          │  ← Game logic + Events
├─────────────┼───────────────────────┤
│   Players   │     Spectators        │  ← AI agents + Observers
└─────────────┴───────────────────────┘

Single Turn Flow

Core Components

Games define rules and state

Required properties: instructions, allowed_actions, default_handshake_template
Core methods: setup(), get_view(), update(), status()
State is JSON-serializable dicts (no complex objects)
Examples: FixedDamageGame and ArchivistChoiceGame

Players are AI agents making decisions

Three-phase lifecycle: Handshake → Turn → Conclusion
Built-in: GPTPlayer, ClaudePlayer, GeminiPlayer, MockPlayer
Composable prompt templates via PromptBuilder

Controllers parse AI responses into actions

ActionOnlyController - extracts single action token
ReasoningController - extracts reasoning + action
Handshake validation is built into the base Controller (default accepts exactly OK)

Renderers format game state for AI consumption

TextRenderer - human-readable text format
Custom renderers can provide JSON, images, etc.

Spectators observe and analyze matches

MatchReporter - turn-by-turn reporting
MatchCurator - sidecar metadata for replay viewer curation
ProgressDisplay - real-time progress with ETA
TokenUsageTracker - cost tracking per player/model
StatsTracker - win rates and performance metrics

Recording & Replay

Recorder - captures complete match data to JSON
ReplayEngine - reconstructs matches with event parity guarantee

💡 Key Features

1. Event-Driven Observation

Everything is observable through events - no modifications needed to games:

from agentdeck import AgentDeck
from agentdeck.spectators import MatchReporter, TokenUsageTracker

# Add spectators for observation
with AgentDeck(game=game, spectators=[
    MatchReporter(),      # Turn-by-turn reporting
    TokenUsageTracker()   # Cost tracking
]) as deck:
    results = deck.play(players, matches=1)

2. Complete Recording & Replay

Every match is automatically recorded with full metadata:

from pathlib import Path

from agentdeck import AgentDeck, MatchReporter

with AgentDeck(game=game) as deck:
    results = deck.play(players, matches=3, seed=7)

    # Replay from memory (no file I/O)
    deck.replay(match=results[0], spectators=[MatchReporter()], speed=0.0)

    # Or replay from disk (recorded under records/)
    record_dir = Path(deck.session.record_directory)
    match_path = sorted(record_dir.glob("match_*.json"))[0]
    deck.replay(path=match_path, spectators=[MatchReporter()], speed=0.0)

Replay Parity Guarantee: Replay emits identical event stream as live execution, including complete three-phase lifecycle (handshake → gameplay → conclusion).

3. Reproducible Experiments

Seeding makes game-level randomness reproducible (player ordering, RNG) and guarantees recording/replay parity. However, LLM outputs are not guaranteed to be deterministic across runs, even with a fixed seed.

from agentdeck import AgentDeck, AgentDeckConfig, MockPlayer

config = AgentDeckConfig(seed=42)
players = [
    MockPlayer(name="Alice", actions=["ATTACK", "POTION"]),
    MockPlayer(name="Bob", actions=["POTION", "ATTACK"]),
]

with AgentDeck(game=game, session=config) as deck:
    results = deck.play(players=players, matches=10)

4. Three-Phase Player Lifecycle

Players go through structured interaction phases:

Handshake (Mandatory): Player acknowledges rules and format
Turn (Gameplay): Player makes decisions each turn
Conclusion (Optional): Player reflects on match outcome

This provides rich data for analyzing AI behavior patterns.

📚 Documentation

Documentation Index - Main docs entry point
CONTRIBUTING.md - Workflow, local setup, tests
Specs - Specification index (source of truth)
Examples - Runnable examples and tutorials
Security Policy - Vulnerability reporting process

🎯 Design Principles

Spec-Driven: Every component has a rigorous specification
Observable: Every decision is captured and analyzable
Reproducible: Everything we control is reproducible (seeding + recordings + replay parity)
Composable: Mix and match components freely
Research-First: Built by researchers, for researchers

Spec-Driven and AI-First by Design

AgentDeck is human-led and AI-written: a codebase built with AI agents, designed for humans and AI agents, and validated through tests, replayable experiments, research artifacts, and blind QA rounds performed by autonomous agents.

Specs are the source of truth. They define intent, contracts, boundaries, and expected behavior. Code, tests, docs, examples, and research workflows derive from that specification layer and are validated through execution.

AgentDeck is therefore designed to be legible to both humans and AI agents, treating AI agents as first-class users, contributors, evaluators, and research operators.

📝 License

MIT License (see LICENSE).

Built with ❤️ for AI researchers

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jul 13, 2026

0.1.3 yanked

Jul 13, 2026

Reason this release was yanked:

Superseded by 0.2.0. Version 0.1.3 unintentionally used a patch version for a package containing schema/API changes.

0.1.2

May 8, 2026

0.1.1

Apr 22, 2026

0.1.0

Apr 22, 2026

0.1.0rc2 pre-release

Dec 13, 2025

0.1.0rc1 pre-release

Nov 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentdeck_ai-0.2.0.tar.gz (245.7 kB view details)

Uploaded Jul 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentdeck_ai-0.2.0-py3-none-any.whl (277.9 kB view details)

Uploaded Jul 13, 2026 Python 3

File details

Details for the file agentdeck_ai-0.2.0.tar.gz.

File metadata

Download URL: agentdeck_ai-0.2.0.tar.gz
Upload date: Jul 13, 2026
Size: 245.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for agentdeck_ai-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`841121c8c2db83c2f6bb0764196f5629913dbf45aa2e8dd6dcb01477034a2b59`
MD5	`5144b2a823faedba99862d74f4676062`
BLAKE2b-256	`890f1018ad69a41dedf96fabff8d5fa6d0813d7b690f4f164508aad13206d448`

See more details on using hashes here.

File details

Details for the file agentdeck_ai-0.2.0-py3-none-any.whl.

File metadata

Download URL: agentdeck_ai-0.2.0-py3-none-any.whl
Upload date: Jul 13, 2026
Size: 277.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for agentdeck_ai-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7b43d18c76a37bbc9bd636f690ad24f1ab3c96aa69e6b66332b554b2e70cc6bb`
MD5	`bfd27ecc2a0777e5045c13dcfbef383b`
BLAKE2b-256	`a9fe49edb7b5226a88482e380ca16200f8ba0cf5f2b9157f9d1ffc062f691e12`

See more details on using hashes here.

agentdeck-ai 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AgentDeck 🎮

🎯 Purpose & Vision

🎬 Run, Record, Replay

🎮 Why Games?

🔎 Flagship Evidence

🚀 Quick Start

Installation

Your First Experiment

Try AgentDeck Without API Keys

Recommended Learning Path

Walkthroughs & Docs

Artifacts (Recordings + Logs)

Parallel Execution (Workload-Dependent Speedups)

🔬 Research Program

⚙️ Architecture

The Console Metaphor

Single Turn Flow

Core Components

💡 Key Features

1. Event-Driven Observation

2. Complete Recording & Replay

3. Reproducible Experiments

4. Three-Phase Player Lifecycle

📚 Documentation

🎯 Design Principles

Spec-Driven and AI-First by Design

📝 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes