Framework-agnostic platform for testing and evaluating AI agents

These details have not been verified by PyPI

Project description

Agent Test Platform (ATP)

Framework-agnostic platform for testing and evaluating AI agents

Overview

ATP (Agent Test Platform) is a framework-agnostic platform for testing and evaluating AI agents. It provides a unified protocol and infrastructure for testing agents regardless of their implementation framework (LangGraph, CrewAI, AutoGen, custom, etc.).

Key principle: Agent = black box with a contract (input → output + events via ATP Protocol).

The Problem

Modern AI agents are complex systems with non-deterministic behavior, multi-step logic, and dependencies on external tools. Traditional software testing approaches don't work for agents:

Stochasticity: same prompt yields different results
Emergent behavior: system behavior isn't the sum of components
Decision chains: early errors manifest later
Framework dependency: each team uses different stack

The Solution

ATP provides:

Unified Protocol: Standard interface for all agents
Declarative Testing: YAML-based test definitions
Multi-Level Evaluation: Artifact checks → behavior analysis → LLM-as-judge
Statistical Reliability: Multiple runs with confidence intervals
Framework Agnostic: Works with any agent implementation
CI/CD Ready: JUnit XML, HTML reports, GitHub Actions integration

Quick Start

Installation

# Clone repository
git clone https://github.com/yourusername/atp-platform.git
cd atp-platform

# Install dependencies (requires uv)
uv sync

# Verify installation
uv run pytest tests/ -v

Run Your First Test

# Quick demo - run file operations agent (no API keys required)
uv run atp test examples/test_suites/demo_file_agent.yaml \
  --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["examples/demo_agent.py"]' \
  -v

# Run OpenAI-powered agent (requires OPENAI_API_KEY)
export OPENAI_API_KEY='sk-...'
uv run atp test examples/test_suites/openai_agent.yaml \
  --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["examples/openai_agent.py"]' \
  --adapter-config='inherit_environment=true' \
  --adapter-config='allowed_env_vars=["OPENAI_API_KEY","OPENAI_MODEL"]' \
  -v

# Run with multiple iterations for statistical reliability
uv run atp test suite.yaml --adapter=http \
  --adapter-config='endpoint=http://localhost:8000' \
  --runs=5

# Run specific tags
uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --tags=smoke

# Generate JSON report
uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --output=json --output-file=results.json

Your First Test Suite

Create a test suite file my_tests.yaml:

test_suite: "my_first_suite"
version: "1.0"
description: "My first ATP test suite"

defaults:
  runs_per_test: 3
  timeout_seconds: 180

agents:
  - name: "my-agent"
    type: "http"
    config:
      endpoint: "http://localhost:8000"

tests:
  - id: "test-001"
    name: "Basic file creation test"
    tags: ["smoke", "basic"]
    task:
      description: "Create a file named output.txt with content 'Hello, ATP!'"
      expected_artifacts: ["output.txt"]
    constraints:
      max_steps: 5
      timeout_seconds: 60
    assertions:
      - type: "artifact_exists"
        config:
          path: "output.txt"
      - type: "llm_eval"
        config:
          criteria: "completeness"
          threshold: 0.8

Features

Core Platform

✅ Test Runner - Full test orchestration with parallel execution

Single test and suite execution
Configurable parallelism (--parallel)
Timeout enforcement (soft and hard)
Progress reporting and fail-fast mode

✅ Agent Adapters - Connect to any agent type

HTTPAdapter - REST/SSE endpoints
ContainerAdapter - Docker-based agents
CLIAdapter - Command-line agents
LangGraphAdapter - Native LangGraph integration
CrewAIAdapter - CrewAI framework support
AutoGenAdapter - AutoGen framework support
MCPAdapter - Model Context Protocol (MCP) tools/resources
BedrockAdapter - AWS Bedrock integration
VertexAdapter - Google Vertex AI integration
AzureOpenAIAdapter - Azure OpenAI integration

✅ Evaluators - Multi-level result assessment

ArtifactEvaluator - File existence, content, schema validation
BehaviorEvaluator - Tool usage, step limits, error checks
LLMJudgeEvaluator - Semantic evaluation via Claude
CodeExecEvaluator - Run generated code (pytest, npm, custom)
SecurityEvaluator - PII detection, secret leaks, code safety, prompt injection
FactualityEvaluator - Claim extraction, citation checking, hallucination detection
StyleEvaluator - Tone analysis, readability, formatting compliance
PerformanceEvaluator - Latency, throughput, regression detection

✅ Reporters - Multiple output formats

Console - Colored terminal output with progress
JSON - Structured results for automation
HTML - Self-contained visual reports with charts
JUnit XML - CI/CD integration (Jenkins, GitHub, GitLab)
GameReporter / GameHTMLReporter - Game-theoretic evaluation results

Advanced Features

✅ Statistical Analysis - Reliable metrics

Multiple runs per test
Mean, std, median, min/max
95% confidence intervals (t-distribution)
Stability assessment

✅ Baseline & Regression Detection

Save baseline results
Compare runs with Welch's t-test
Detect regressions (p < 0.05)
Visual diff in console/JSON

✅ CI/CD Integration

GitHub Actions workflow
GitLab CI template
Azure Pipelines, CircleCI, Jenkins examples
Exit codes: 0=success, 1=failures, 2=error

✅ Web Dashboard (optional)

FastAPI backend
Results storage (SQLite/PostgreSQL)
Historical trends
Agent comparison

Project Structure

atp-platform/
├── atp/                      # Main package
│   ├── cli/                  # CLI commands (test, validate, baseline, dashboard, game, etc.)
│   ├── core/                 # Config, exceptions, security
│   ├── protocol/             # ATP Request/Response/Event models
│   ├── loader/               # YAML/JSON test parsing
│   ├── runner/               # Test orchestration, sandbox
│   ├── adapters/             # Agent adapters (HTTP, Docker, CLI, LangGraph, CrewAI, AutoGen, MCP, Bedrock, Vertex, Azure OpenAI)
│   ├── evaluators/           # Result evaluation (artifact, behavior, LLM, code, security, factuality, style, performance)
│   ├── scoring/              # Score aggregation
│   ├── statistics/           # Statistical analysis
│   ├── baseline/             # Baseline management, regression detection
│   ├── reporters/            # Output formatting (console, JSON, HTML, JUnit, game)
│   ├── streaming/            # Event streaming support
│   ├── mock_tools/           # Mock tool server for testing
│   ├── performance/          # Profiling, caching, optimization
│   ├── dashboard/            # Web interface (FastAPI)
│   ├── analytics/            # Cost tracking and analytics
│   ├── benchmarks/           # Benchmark suites
│   ├── chaos/                # Chaos testing
│   ├── generator/            # Test suite generation
│   ├── plugins/              # Plugin ecosystem management
│   └── tui/                  # Terminal user interface (optional)
├── game-environments/        # Standalone game theory library (Phase 5)
│   └── game_envs/            # Games, strategies, analysis (Nash, exploitability)
├── atp-games/                # ATP plugin for game-theoretic evaluation (Phase 5)
│   └── atp_games/            # GameRunner, evaluators, YAML suites, tournaments
├── docs/                     # Documentation
├── examples/                 # Example test suites and CI configs
│   ├── test_suites/          # Sample test suites
│   ├── games/                # Game-theoretic evaluation examples
│   ├── docker/               # Docker deployment examples
│   └── ci/                   # CI/CD templates
├── tests/                    # Test suite (80%+ coverage)
│   ├── unit/                 # Unit tests
│   ├── integration/          # Integration tests
│   ├── contract/             # Protocol contract tests
│   ├── e2e/                  # End-to-end tests
│   └── fixtures/             # Test fixtures
├── spec/                     # Working directory for specifications (managed by /spec-generator-skill)
│   ├── requirements.md       # Phase 4 feature requirements (REQ-XXX)
│   ├── phase5-requirements.md # Phase 5 game-theoretic requirements
│   ├── design.md             # Phase 4 technical design (DESIGN-XXX)
│   ├── phase5-design.md      # Phase 5 technical design
│   ├── tasks.md              # Phase 4 implementation tasks (TASK-XXX)
│   ├── phase5-tasks.md       # Phase 5 implementation tasks
│   └── WORKFLOW.md           # Task management workflow guide
└── pyproject.toml            # Project configuration

CLI Commands

# Run tests with CLI adapter
uv run atp test <suite.yaml> --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]'

# Run tests with HTTP adapter
uv run atp test <suite.yaml> --adapter=http \
  --adapter-config='endpoint=http://localhost:8000'

# Run with multiple iterations and parallel execution
uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --runs=5 --parallel=4

# Filter by tags
uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --tags=smoke,core

# Output formats
uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --output=json --output-file=results.json

uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --output=junit --output-file=results.xml

# Pass environment variables (for API keys)
uv run atp test suite.yaml --adapter=cli \
  --adapter-config='command=python' \
  --adapter-config='args=["agent.py"]' \
  --adapter-config='inherit_environment=true' \
  --adapter-config='allowed_env_vars=["OPENAI_API_KEY","ANTHROPIC_API_KEY"]'

# Validate test definitions
uv run atp validate --suite=suite.yaml

# Baseline management
uv run atp baseline save suite.yaml -o baseline.json --runs=5
uv run atp baseline compare suite.yaml -b baseline.json

# Utilities
uv run atp list-agents          # List available adapters
uv run atp version              # Show version
uv run atp list suite.yaml      # List tests in a suite

# Additional commands
uv run atp init                 # Initialize ATP project
uv run atp generate             # Generate test suites
uv run atp benchmark            # Run benchmarks
uv run atp budget               # Budget management
uv run atp experiment           # Run experiments
uv run atp plugins              # Manage plugins
uv run atp game suite.yaml      # Game-theoretic evaluation
uv run atp tui                  # Terminal user interface

Documentation

Getting Started

Installation Guide - Setup and dependencies
Quick Start Guide - First test suite
Basic Usage - Common workflows

Reference

Test Format Reference - YAML structure specification
Adapter Configuration - Configure agent adapters
Configuration Reference - All config options
API Reference - Python API
Dashboard API Reference - REST API for comparison, leaderboard, timeline
Troubleshooting - Common issues and solutions

Architecture

Vision & Goals - Project vision
Requirements - Functional requirements
Architecture - System architecture
ATP Protocol - Protocol specification
Evaluation System - Metrics and evaluation
Integration Guide - Agent integration
Roadmap - Project roadmap and milestones
CI/CD Integration - CI/CD setup
Security - Security model
Architecture Decision Records - Key design decisions

Game-Theoretic Evaluation

game-environments README - Game library: API, game dev guide, strategies, analysis tools
atp-games README - ATP plugin: quick start, YAML reference, evaluators, tournaments

Examples

See examples/ for:

Test Suites - Sample test definitions
Game Examples - Game-theoretic evaluation (README, no API keys needed):
- basic_usage.py - Run games, strategies, and tournaments
- custom_game.py - Create a new game from scratch
- llm_agent_eval.py - Evaluate agents on game battery
- population_dynamics.py - Evolutionary simulation
CI/CD Templates - GitHub Actions, GitLab CI, Jenkins, Azure, CircleCI
Demo Agents - Ready-to-run example agents:
- demo_agent.py - Simple file operations agent (no API keys needed)
- openai_agent.py - OpenAI-powered agent with tool calling
- run_demo.sh / run_openai_demo.sh - Quick start scripts

Development

Commands

# Testing
uv run pytest tests/ -v --cov=atp --cov-report=term-missing  # All tests with coverage
uv run pytest tests/unit -v                                   # Unit tests only
uv run pytest tests/ -v -m "not slow"                        # Fast tests

# Code quality
uv run ruff format .               # Format code
uv run ruff check .                # Lint check
uv run ruff check . --fix          # Auto-fix lint issues
pyrefly check                      # Type checking

# Task management
python task.py list                # List all tasks
python task.py next                # Show ready tasks

Code Style

Python 3.12+
Type hints required for all code
Line length: 88 characters
Use Pydantic for data models
Docstrings for public APIs
Test coverage ≥80%

See CLAUDE.md for detailed development guidelines.

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Write tests for new functionality
Ensure all tests pass and code is formatted
Submit a pull request

See CLAUDE.md for code style and development workflow.

License

MIT License - see LICENSE for details.

Support

Issues: GitHub Issues
Documentation: docs/
Examples: examples/

Phase 5: Game-Theoretic Evaluation

ATP includes a game-theoretic evaluation framework for testing agent strategic reasoning, cooperation, and equilibrium play in multi-agent games.

Packages

Package	Description	Docs
`game-environments`	Standalone game theory library (zero ATP dependency)	README
`atp-games`	ATP plugin for game-theoretic evaluation	README

Built-in Games

Five canonical games with known Nash equilibria for rigorous evaluation:

Prisoner's Dilemma -- cooperation vs defection with configurable payoff matrix
Public Goods Game -- N-player contribution with multiplier and optional punishment
Auction -- first-price and second-price sealed-bid with private values
Colonel Blotto -- resource allocation across multiple battlefields
Congestion Game -- network routing with latency-dependent costs

Game-Theoretic Evaluators

PayoffEvaluator -- average payoff, distribution, social welfare, Pareto efficiency
ExploitabilityEvaluator -- best-response gap, empirical strategy extraction
CooperationEvaluator -- cooperation rate, conditional cooperation, reciprocity
EquilibriumEvaluator -- Nash distance, convergence detection, equilibrium classification

Quick Start (Games)

# Run a built-in game suite
uv run atp test --suite=game:prisoners_dilemma.yaml

# Or use programmatically

from game_envs import PrisonersDilemma, PDConfig, TitForTat, AlwaysDefect
from atp_games import GameRunner, GameRunConfig, BuiltinAdapter
import asyncio

async def main():
    game = PrisonersDilemma(PDConfig(num_rounds=50))
    agents = {
        "player_0": BuiltinAdapter(TitForTat()),
        "player_1": BuiltinAdapter(AlwaysDefect()),
    }
    runner = GameRunner()
    result = await runner.run_game(
        game=game, agents=agents,
        config=GameRunConfig(episodes=20, base_seed=42),
    )
    print(result.average_payoffs)

asyncio.run(main())

See examples/games/ for more examples.

Status

Current Status: GA (General Availability)

All core features implemented:

✅ MVP: Protocol, Adapters, Runner, Evaluators, Reporters, CLI
✅ Beta: Framework adapters, Statistics, LLM-Judge, Baseline, HTML reports, CI/CD
✅ GA: Dashboard, Security hardening, Performance optimization
✅ Phase 5: Game-theoretic evaluation (game-environments + atp-games)

Specifications Directory

The spec/ directory is a working directory for current development specifications, managed by the /spec-generator-skill Claude skill. It contains:

requirements.md — Feature requirements in Kiro-style format (REQ-XXX)
design.md — Technical design and architecture (DESIGN-XXX)
tasks.md — Implementation tasks with dependencies (TASK-XXX)
WORKFLOW.md — Task management and executor workflow guide

Specifications evolve with the project. See spec/tasks.md for current task status.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.1.0

Jun 12, 2026

2.0.0

May 16, 2026

This version

1.0.0

Feb 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atp_platform-1.0.0.tar.gz (2.1 MB view details)

Uploaded Feb 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

atp_platform-1.0.0-py3-none-any.whl (765.5 kB view details)

Uploaded Feb 13, 2026 Python 3

File details

Details for the file atp_platform-1.0.0.tar.gz.

File metadata

Download URL: atp_platform-1.0.0.tar.gz
Upload date: Feb 13, 2026
Size: 2.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for atp_platform-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`28e9f254057780df4a669a098fca1d11687e291f9969261f3baec934b79d1513`
MD5	`a55888cdb4c52d81f95131cf636aea39`
BLAKE2b-256	`6af20347f205f591ac817feef8251522f7f64f4bdfb6d15cb95f58d7183539fd`

See more details on using hashes here.

File details

Details for the file atp_platform-1.0.0-py3-none-any.whl.

File metadata

Download URL: atp_platform-1.0.0-py3-none-any.whl
Upload date: Feb 13, 2026
Size: 765.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for atp_platform-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`add64e2b06eb45d60b5d8acd616907392548470a27555e24b521e9112a437d73`
MD5	`ee95dbc04cbc7883137cc45f7c981899`
BLAKE2b-256	`279b7daa99180177f7799e011b68b3632ecb5f918ed0a99ead81fdcca4ad68b6`

See more details on using hashes here.

atp-platform 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Agent Test Platform (ATP)

Overview

The Problem

The Solution

Quick Start

Installation

Run Your First Test

Your First Test Suite

Features

Core Platform

Advanced Features

Project Structure

CLI Commands

Documentation

Getting Started

Reference

Architecture

Game-Theoretic Evaluation

Examples

Development

Commands

Code Style

Contributing

License

Support

Phase 5: Game-Theoretic Evaluation

Packages

Built-in Games

Game-Theoretic Evaluators

Quick Start (Games)

Status

Specifications Directory

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes