Skip to main content

Framework-agnostic platform for testing and evaluating AI agents

Project description

ATP — Agent Test Platform

The framework-agnostic platform for testing and evaluating AI agents.

Python 3.12+ License: MIT Coverage

Why ATP?

  • Framework-agnostic — test any agent (LangGraph, CrewAI, AutoGen, HTTP endpoint, CLI, container, cloud) through a single unified protocol. No vendor lock-in.
  • Game-theoretic evaluation — the only platform with built-in multi-agent game evaluation: Prisoner's Dilemma, Public Goods, Auction, Colonel Blotto, Congestion Game, and more. Measure strategic reasoning, cooperation, and equilibrium play.
  • Statistical rigor — multiple runs per test, 95% confidence intervals, Welch's t-test regression detection, and Elo ratings. Know when a change is real, not noise.
  • Production-ready — web dashboard, SQLite/PostgreSQL storage, JUnit XML for CI/CD, HTML reports, cost tracking, and security evaluation out of the box.

Quick Start

uv add atp-platform
atp quickstart

See the Quick Start Guide for a full walkthrough.

Quick Start (from source)

git clone https://github.com/andrei-shtanakov/atp-platform.git
cd atp-platform
uv sync --group dev
uv run pytest tests/ -v  # verify installation

Your First Test Suite

Create a test suite file my_tests.yaml:

test_suite: "my_first_suite"
version: "1.0"
description: "My first ATP test suite"

defaults:
  runs_per_test: 3
  timeout_seconds: 180

agents:
  - name: "my-agent"
    type: "http"
    config:
      endpoint: "http://localhost:8000"

tests:
  - id: "test-001"
    name: "Basic file creation test"
    tags: ["smoke", "basic"]
    task:
      description: "Create a file named output.txt with content 'Hello, ATP!'"
      expected_artifacts: ["output.txt"]
    constraints:
      max_steps: 5
      timeout_seconds: 60
    assertions:
      - type: "artifact_exists"
        config:
          path: "output.txt"
      - type: "llm_eval"
        config:
          criteria: "completeness"
          threshold: 0.8

Features

Core Platform

Test Runner - Full test orchestration with parallel execution

  • Single test and suite execution
  • Configurable parallelism (--parallel)
  • Timeout enforcement (soft and hard)
  • Progress reporting and fail-fast mode

Agent Adapters - Connect to any agent type

  • HTTPAdapter - REST/SSE endpoints
  • ContainerAdapter - Docker-based agents
  • CLIAdapter - Command-line agents
  • LangGraphAdapter - Native LangGraph integration
  • CrewAIAdapter - CrewAI framework support
  • AutoGenAdapter - AutoGen framework support
  • MCPAdapter - Model Context Protocol (MCP) tools/resources
  • BedrockAdapter - AWS Bedrock integration
  • VertexAdapter - Google Vertex AI integration
  • AzureOpenAIAdapter - Azure OpenAI integration
  • SDKAdapter - Pull-model adapter for SDK-based benchmark participants

Evaluators - Multi-level result assessment

  • ArtifactEvaluator - File existence, content, schema validation
  • BehaviorEvaluator - Tool usage, step limits, error checks
  • LLMJudgeEvaluator - Semantic evaluation via Claude
  • CodeExecEvaluator - Run generated code (pytest, npm, custom)
  • SecurityEvaluator - PII detection, secret leaks, code safety, prompt injection
  • FactualityEvaluator - Claim extraction, citation checking, hallucination detection
  • StyleEvaluator - Tone analysis, readability, formatting compliance
  • FilesystemEvaluator - Workspace file existence, content, directory checks
  • PerformanceEvaluator - Latency, throughput, regression detection
  • CompositeEvaluator - Boolean logic (AND/OR/NOT) over nested assertions
  • GitCommitEvaluator - Git commit message and diff analysis
  • GuardrailsEvaluator - Custom guardrails enforcement
  • ContainerEvaluator - Isolated code execution via Docker/Podman with resource limits

Reporters - Multiple output formats

  • Console - Colored terminal output with progress
  • JSON - Structured results for automation
  • HTML - Self-contained visual reports with charts
  • JUnit XML - CI/CD integration (Jenkins, GitHub, GitLab)
  • GameReporter / GameHTMLReporter - Game-theoretic evaluation results

Advanced Features

Statistical Analysis - Reliable metrics

  • Multiple runs per test
  • Mean, std, median, min/max
  • 95% confidence intervals (t-distribution)
  • Stability assessment

Baseline & Regression Detection

  • Save baseline results
  • Compare runs with Welch's t-test
  • Detect regressions (p < 0.05)
  • Visual diff in console/JSON

CI/CD Integration

  • GitHub Actions workflow
  • GitLab CI template
  • Azure Pipelines, CircleCI, Jenkins examples
  • Exit codes: 0=success, 1=failures, 2=error
  • Deploy pipeline (.github/workflows/deploy.yml) — SSH deploy via [deploy] tag in commit message or workflow_dispatch

Web Dashboard

  • FastAPI backend with HTMX + Pico CSS frontend at /ui/
  • Results storage (SQLite/PostgreSQL)
  • Working UI pages: Benchmarks (upload + create), Runs (list + detail page with HTMX auto-refresh), Leaderboard (benchmark filter), Games (registry + tournaments), Suites (upload YAML), Analytics (stats + agent rankings)
  • GitHub OAuth login, Device Flow for CLI auth, JWT tokens, RBAC

Platform API & SDK

  • Benchmark API (/api/v1/benchmarks, /api/v1/runs) - Pull-model benchmark execution with leaderboard
  • Tournament API (/api/v1/tournaments) - Game-theoretic tournament management
  • Auth - GitHub OAuth (OIDC) + Device Flow for CLI login + JWT tokens
  • RBAC - Role-based access control with auto-admin for first user
  • Python SDK v2.0.0 (atp-platform-sdk on PyPI) - AsyncATPClient + sync ATPClient wrapper, BenchmarkRun async/sync iteration with submit_sync()/status_sync()/cancel_sync()/emit_sync(), next_batch(n) batch API, emit() event streaming, exponential-backoff retry, Device Flow auth
  • Dashboard UI - HTMX + Pico CSS frontend at /ui/ (benchmarks, games, runs, leaderboard, suites, analytics)
  • YAML Upload (POST /api/suite-definitions/upload) - upload and validate test suites server-side
  • Rate Limiting - Per-endpoint HTTP rate limiting via slowapi (configurable via ATP_RATE_LIMIT_* env vars)
  • Webhooks - HTTP POST notifications on run completion/failure with SSRF protection and retry
  • Event Streaming (POST /api/v1/runs/{id}/events) - Append events to running benchmark runs (max 1000/run)

Project Structure

atp-platform/
├── atp/                      # Main package
│   ├── cli/                  # CLI commands (test, validate, baseline, dashboard, game, etc.)
│   ├── core/                 # Config, exceptions, security
│   ├── protocol/             # ATP Request/Response/Event models
│   ├── loader/               # YAML/JSON test parsing
│   ├── runner/               # Test orchestration, sandbox
│   ├── adapters/             # Agent adapters (HTTP, Docker, CLI, LangGraph, CrewAI, AutoGen, MCP, Bedrock, Vertex, Azure OpenAI, SDK/pull-model)
│   ├── evaluators/           # Result evaluation (artifact, behavior, LLM, code, security, factuality, style, performance, git-commit, guardrails, container)
│   ├── scoring/              # Score aggregation
│   ├── statistics/           # Statistical analysis
│   ├── baseline/             # Baseline management, regression detection
│   ├── reporters/            # Output formatting (console, JSON, HTML, JUnit, game)
│   ├── streaming/            # Event streaming support
│   ├── mock_tools/           # Mock tool server for testing
│   ├── performance/          # Profiling, caching, optimization
│   ├── dashboard/            # Web interface (FastAPI)
│   ├── analytics/            # Cost tracking and analytics
│   ├── benchmarks/           # Benchmark suites
│   ├── chaos/                # Chaos testing
│   ├── generator/            # Test suite generation
│   ├── plugins/              # Plugin ecosystem management
│   ├── sdk/                  # Python SDK for programmatic use
│   ├── tracing/              # Agent replay and trace management
│   └── tui/                  # Terminal user interface (optional)
├── packages/                  # Extracted packages (uv workspace members)
│   ├── atp-core/             # Protocol, core, loader, scoring, statistics
│   ├── atp-adapters/         # All agent adapters
│   ├── atp-dashboard/        # Web dashboard + benchmark/tournament API
│   └── atp-sdk/              # Python SDK for benchmark platform participants
├── game-environments/        # Standalone game theory library (Phase 5)
│   └── game_envs/            # Games, strategies, analysis (Nash, exploitability)
├── atp-games/                # ATP plugin for game-theoretic evaluation (Phase 5)
│   └── atp_games/            # GameRunner, evaluators, YAML suites, tournaments
├── docs/                     # Documentation
├── examples/                 # Example test suites and CI configs
│   ├── test_suites/          # Sample test suites
│   ├── games/                # Game-theoretic evaluation examples
│   ├── docker/               # Docker deployment examples
│   └── ci/                   # CI/CD templates
├── tests/                    # Test suite (80%+ coverage)
│   ├── unit/                 # Unit tests
│   ├── integration/          # Integration tests
│   ├── contract/             # Protocol contract tests
│   ├── e2e/                  # End-to-end tests
│   └── fixtures/             # Test fixtures
├── spec/                     # Working directory for specifications (managed by /spec-generator-skill)
│   ├── requirements.md       # Phase 4 feature requirements (REQ-XXX)
│   ├── phase5-requirements.md # Phase 5 game-theoretic requirements
│   ├── design.md             # Phase 4 technical design (DESIGN-XXX)
│   ├── phase5-design.md      # Phase 5 technical design
│   ├── tasks.md              # Phase 4 implementation tasks (TASK-XXX)
│   ├── phase5-tasks.md       # Phase 5 implementation tasks
│   └── WORKFLOW.md           # Task management workflow guide
└── pyproject.toml            # Project configuration

CLI Commands

# Run tests with CLI adapter
uv run atp test <suite.yaml> --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]'

# Run tests with HTTP adapter
uv run atp test <suite.yaml> --adapter=http \
  --adapter-config='endpoint=http://localhost:8000'

# Run with multiple iterations and parallel execution
uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --runs=5 --parallel=4

# Filter by tags
uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --tags=smoke,core

# Output formats
uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --output=json --output-file=results.json

uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --output=junit --output-file=results.xml

# Pass environment variables (for API keys)
uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --adapter-config='inherit_environment=true' \
  --adapter-config='allowed_env_vars=["OPENAI_API_KEY","ANTHROPIC_API_KEY"]'

# Validate test definitions
uv run atp validate --suite=suite.yaml

# Baseline management
uv run atp baseline save suite.yaml -o baseline.json --runs=5
uv run atp baseline compare suite.yaml -b baseline.json

# Utilities
uv run atp list-agents          # List available adapters
uv run atp version              # Show version
uv run atp list suite.yaml      # List tests in a suite

# Additional commands
uv run atp init                 # Initialize ATP project
uv run atp generate             # Generate test suites
uv run atp benchmark            # Run benchmarks
uv run atp budget               # Budget management
uv run atp experiment           # Run experiments
uv run atp plugins              # Manage plugins
uv run atp game suite.yaml      # Game-theoretic evaluation
uv run atp catalog              # Browse and run tests from the catalog
uv run atp tui                  # Terminal user interface
uv run atp compare              # Multi-model comparison
uv run atp estimate             # Cost estimation
uv run atp traces               # Trace management
uv run atp replay               # Replay agent traces
uv run atp trend                # Cross-run trend analysis (regression detection)

# Suite sync (push/pull/sync YAML test suites to/from remote server)
uv run atp push suite.yaml --server=https://atp.example.com  # Upload YAML to server
uv run atp pull --server=https://atp.example.com             # Download suites from server
uv run atp sync                                               # Sync local suites with remote

Documentation

Getting Started

Reference

Architecture

Game-Theoretic Evaluation

Examples

See examples/ for:

  • Test Suites - Sample test definitions
  • Game Examples - Game-theoretic evaluation (README, no API keys needed):
    • basic_usage.py - Run games, strategies, and tournaments
    • custom_game.py - Create a new game from scratch
    • llm_agent_eval.py - Evaluate agents on game battery
    • population_dynamics.py - Evolutionary simulation
  • CI/CD Templates - GitHub Actions, GitLab CI, Jenkins, Azure, CircleCI
  • Demo Agents - Ready-to-run example agents:
    • demo_agent.py - Simple file operations agent (no API keys needed)
    • openai_agent.py - OpenAI-powered agent with tool calling
    • run_demo.sh / run_openai_demo.sh - Quick start scripts

Development

Commands

# Testing
uv run pytest tests/ -v --cov=atp --cov-report=term-missing  # All tests with coverage
uv run pytest tests/unit -v                                   # Unit tests only
uv run pytest tests/ -v -m "not slow"                        # Fast tests

# Code quality
uv run ruff format .               # Format code
uv run ruff check .                # Lint check
uv run ruff check . --fix          # Auto-fix lint issues
uv run pyrefly check               # Type checking

# Task management
python task.py list                # List all tasks
python task.py next                # Show ready tasks

Code Style

  • Python 3.12+
  • Type hints required for all code
  • Line length: 88 characters
  • Use Pydantic for data models
  • Docstrings for public APIs
  • Test coverage ≥80%

See CLAUDE.md for detailed development guidelines.

macOS launchers

For Mac users who prefer double-clicking over the CLI, see scripts/macos/ — double-clickable .command files that install dependencies and run bundled game suites (Prisoner's Dilemma, Auction, El Farol).

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Write tests for new functionality
  4. Ensure all tests pass and code is formatted
  5. Submit a pull request

See CLAUDE.md for code style and development workflow.

License

MIT License - see LICENSE for details.

Support

Phase 5: Game-Theoretic Evaluation

ATP includes a game-theoretic evaluation framework for testing agent strategic reasoning, cooperation, and equilibrium play in multi-agent games.

Packages

Package Description Docs
game-environments Standalone game theory library (zero ATP dependency) README
atp-games ATP plugin for game-theoretic evaluation README
atp-platform-sdk Python SDK for benchmark participants README

Built-in Games

Eight canonical games with known Nash equilibria for rigorous evaluation:

  • Prisoner's Dilemma -- cooperation vs defection with configurable payoff matrix
  • Stag Hunt -- trust vs safety, two pure Nash equilibria
  • Battle of the Sexes -- coordination under conflicting preferences
  • Public Goods Game -- N-player contribution with multiplier and optional punishment
  • Auction -- first-price and second-price sealed-bid with private values
  • Colonel Blotto -- resource allocation across multiple battlefields
  • Congestion Game -- network routing with latency-dependent costs
  • El Farol Bar -- bounded rationality and minority game dynamics

Game-Theoretic Evaluators

  • PayoffEvaluator -- average payoff, distribution, social welfare, Pareto efficiency
  • ExploitabilityEvaluator -- best-response gap, empirical strategy extraction
  • CooperationEvaluator -- cooperation rate, conditional cooperation, reciprocity
  • EquilibriumEvaluator -- Nash distance, convergence detection, equilibrium classification

Quick Start (Games)

# Run a built-in game suite
uv run atp test --suite=game:prisoners_dilemma.yaml

# Or use programmatically
from game_envs import PrisonersDilemma, PDConfig, TitForTat, AlwaysDefect
from atp_games import GameRunner, GameRunConfig, BuiltinAdapter
import asyncio

async def main():
    game = PrisonersDilemma(PDConfig(num_rounds=50))
    agents = {
        "player_0": BuiltinAdapter(TitForTat()),
        "player_1": BuiltinAdapter(AlwaysDefect()),
    }
    runner = GameRunner()
    result = await runner.run_game(
        game=game, agents=agents,
        config=GameRunConfig(episodes=20, base_seed=42),
    )
    print(result.average_payoffs)

asyncio.run(main())

See examples/games/ for more examples.

Status

Current Status: GA (General Availability)

All core features implemented:

  • ✅ MVP: Protocol, Adapters, Runner, Evaluators, Reporters, CLI
  • ✅ Beta: Framework adapters, Statistics, LLM-Judge, Baseline, HTML reports, CI/CD
  • ✅ GA: Dashboard, Security hardening, Performance optimization
  • ✅ Phase 5: Game-theoretic evaluation (game-environments + atp-games)
  • ✅ Platform API & SDK: Benchmark/Tournament REST API, GitHub OAuth, Device Flow, Python SDK (atp-platform-sdk)

Specifications Directory

The spec/ directory is a working directory for current development specifications, managed by the /spec-generator-skill Claude skill. It contains:

  • requirements.md — Feature requirements in Kiro-style format (REQ-XXX)
  • design.md — Technical design and architecture (DESIGN-XXX)
  • tasks.md — Implementation tasks with dependencies (TASK-XXX)
  • WORKFLOW.md — Task management and executor workflow guide

Specifications evolve with the project. See spec/tasks.md for current task status.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atp_platform-2.0.0.tar.gz (4.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atp_platform-2.0.0-py3-none-any.whl (1.4 MB view details)

Uploaded Python 3

File details

Details for the file atp_platform-2.0.0.tar.gz.

File metadata

  • Download URL: atp_platform-2.0.0.tar.gz
  • Upload date:
  • Size: 4.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for atp_platform-2.0.0.tar.gz
Algorithm Hash digest
SHA256 155eb8cad4292e688517f3e60faae89c118720bb890d7aeef915dd3e6a7f9580
MD5 6ddda1dd541c82f021f9b38c7fd0e9a5
BLAKE2b-256 4ad8521dc2829c972b5401c6a66e78891bdf8ca5f5a7a2089f8e1cfcfced8d45

See more details on using hashes here.

Provenance

The following attestation bundles were made for atp_platform-2.0.0.tar.gz:

Publisher: publish.yml on andrei-shtanakov/atp-platform

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file atp_platform-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: atp_platform-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for atp_platform-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c0045be2048922c593e127c050f135cf6c566bff141bbeb66f1adca6cfdaf981
MD5 61cf87d1e8c9b4e90353070a28518a48
BLAKE2b-256 38cabd479fc93a67dd0f60150fd198b87b896600fa19e410ac39bbed182ce07c

See more details on using hashes here.

Provenance

The following attestation bundles were made for atp_platform-2.0.0-py3-none-any.whl:

Publisher: publish.yml on andrei-shtanakov/atp-platform

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page