Skip to main content

Test how well AI agents interact with CLI tools

Project description

AgentProbe

Python License GitHub Stars GitHub Issues

Test how well AI agents interact with your CLI tools. AgentProbe runs Claude Code against any command-line tool and provides actionable insights to improve Agent Experience (AX) - helping CLI developers make their tools more AI-friendly.

AgentProbe

Quick Start

# No installation needed - run directly with uvx
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy

# Or install locally for development
uv sync
uv run agentprobe test git --scenario status

Authentication

AgentProbe supports multiple authentication methods to avoid environment pollution:

Get an OAuth Token

First, obtain your OAuth token using Claude Code:

claude setup-token

This will guide you through the OAuth flow and provide a token for authentication.

Method 1: Token File (Recommended)

# Store token in a file (replace with your actual token from claude setup-token)
echo "your-oauth-token-here" > ~/.agentprobe-token

# Use with commands
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy --oauth-token-file ~/.agentprobe-token

Method 2: Config Files

Create a config file in one of these locations (checked in priority order):

# Global user config (replace with your actual token from claude setup-token)
mkdir -p ~/.agentprobe
echo "your-oauth-token-here" > ~/.agentprobe/config

# Project-specific config (add to .gitignore)
echo "your-oauth-token-here" > .agentprobe
echo ".agentprobe" >> .gitignore

# Then run normally without additional flags
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy

Method 3: Environment Variables (Legacy)

# Replace with your actual token from claude setup-token
export CLAUDE_CODE_OAUTH_TOKEN="your-oauth-token-here"
# Note: This may affect other Claude CLI processes

Recommendation: Use token files or config files for better process isolation.

What It Does

AgentProbe launches Claude Code to test CLI tools and provides Agent Experience (AX) insights on:

  • AX Score (A-F) based on turn count and success rate
  • CLI Friction Points - specific issues that confuse agents
  • Actionable Improvements - concrete changes to reduce agent friction
  • Real-time Progress - see agent progress with live turn counts

Community Benchmark

Help us build a comprehensive benchmark of CLI tools! The table below shows how well Claude Code handles various CLIs.

Tool Scenarios Passing Failing Success Rate Last Updated
vercel 9 7 2 77.8% 2025-01-20
gh 1 1 0 100% 2025-01-20
docker 1 1 0 100% 2025-01-20

View detailed results →

Commands

Test Individual Scenarios

# Test a specific scenario (with uvx)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr

# With authentication token file
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --oauth-token-file ~/.agentprobe-token

# Test multiple runs for consistency analysis
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy --runs 5

# With custom working directory
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test docker --scenario run-nginx --work-dir /path/to/project

# Show detailed trace with message debugging (disables progress indicators)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose

# ⚠️ DANGEROUS: Run without permission prompts (use only in safe environments)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test docker --scenario run-nginx --yolo

Benchmark Tools

# Test all scenarios for one tool
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel

# Test all scenarios with authentication
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel --oauth-token-file ~/.agentprobe-token

# Test all available tools and scenarios
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark --all

# ⚠️ DANGEROUS: Run all benchmarks without permission prompts
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark --all --yolo

Reports

# Generate reports (future feature)
uv run agentprobe report --format markdown --output results.md

Debugging and Verbose Output

The --verbose flag provides detailed insights into how Claude Code interacts with your CLI:

# Show full message trace with object types and attributes
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose

Verbose output includes:

  • Message object types (SystemMessage, AssistantMessage, UserMessage, ResultMessage)
  • Message content and tool usage
  • SDK object attributes and debugging information
  • Full conversation trace between Claude and your CLI

⚠️ YOLO Mode (Use with Extreme Caution)

The --yolo flag enables autonomous execution without permission prompts, allowing Claude to run ANY command without user approval:

# WARNING: Only use in isolated, safe environments
agentprobe test docker --scenario build-app --yolo

Security Considerations:

  • ONLY use in containerized or sandboxed environments
  • Claude can execute arbitrary commands including rm -rf, network calls, system modifications
  • No safety guardrails - Claude has full system access
  • Intended for CI/CD pipelines, testing environments, or research purposes
  • NEVER use on production systems or with sensitive data

This mode is equivalent to running Claude Code with --dangerously-skip-permissions.

Example Output

Single Run (Default)

⠋ Agent running... (Turn 3, 12s)
╭───────────────────────────── AgentProbe Results ─────────────────────────────╮
│ Tool: vercel | Scenario: deploy                                               │
│ AX Score: B (12 turns, 80% success rate)                                      │
│                                                                               │
│ Agent Experience Summary:                                                     │
│ Agent completed deployment but needed extra turns due to unclear progress     │
│ feedback and ambiguous success indicators.                                    │
│                                                                               │
│ CLI Friction Points:                                                          │
│ • No progress feedback during build process                                   │
│ • Deployment URL returned before actual completion                            │
│ • Success status ambiguous ("building" vs "deployed")                        │
│                                                                               │
│ Top Improvements for CLI:                                                     │
│ 1. Add --status flag to check deployment progress                             │
│ 2. Include completion status in deployment output                             │
│ 3. Provide structured --json output for programmatic usage                    │
│                                                                               │
│ Expected turns: 5-8 | Duration: 23.4s | Cost: $0.012                         │
│                                                                               │
│ Use --verbose for full trace analysis                                         │
╰───────────────────────────────────────────────────────────────────────────────╯

Multiple Runs (Aggregate)

╭──────────────────────── AgentProbe Aggregate Results ────────────────────────╮
│ Tool: vercel | Scenario: deploy                                               │
│ AX Score: C (14.2 avg turns, 60% success rate) | Runs: 5                      │
│                                                                               │
│ Consistency Analysis:                                                         │
│ • Turn variance: 8-22 turns                                                   │
│ • Success consistency: 60% of runs succeeded                                  │
│ • Agent confusion points: 18 total friction events                            │
│                                                                               │
│ Consistent CLI Friction Points:                                               │
│ • Permission errors lack clear remediation steps                              │
│ • No progress feedback during deployment                                      │
│ • Build failures don't suggest next steps                                     │
│                                                                               │
│ Priority Improvements for CLI:                                                │
│ 1. Add deployment status polling with vercel status                           │
│ 2. Include troubleshooting hints in error messages                            │
│ 3. Provide progress indicators during long operations                          │
│                                                                               │
│ Avg duration: 45.2s | Total cost: $0.156                                      │
╰───────────────────────────────────────────────────────────────────────────────╯

Contributing Scenarios

We welcome scenario contributions! Help us test more CLI tools:

  1. Fork this repository
  2. Add your scenarios under scenarios/<tool-name>/
  3. Run the tests and update the benchmark table
  4. Submit a PR with your results

Scenario Format

Simple Text Format

Create simple text files with clear prompts:

# scenarios/stripe/create-customer.txt
Create a new Stripe customer with email test@example.com and
add a test credit card. Return the customer ID.

Enhanced YAML Format (Recommended)

Use YAML frontmatter for better control and metadata:

# scenarios/vercel/deploy-complex.txt
---
version: 2
created: 2025-01-22
tool: vercel
permission_mode: acceptEdits
allowed_tools: [Read, Write, Bash]
model: opus
max_turns: 15
complexity: complex
expected_turns: 8-12
description: "Production deployment with environment setup"
---
Deploy this Next.js application to production using Vercel CLI.
Configure production environment variables and ensure the deployment
is successful with proper domain configuration.

YAML Frontmatter Options:

  • model: Override default model (sonnet, opus)
  • max_turns: Limit agent interactions
  • permission_mode: Set permissions (acceptEdits, default, plan, bypassPermissions)
  • allowed_tools: Specify tools ([Read, Write, Bash])
  • expected_turns: Range for AX scoring comparison
  • complexity: Scenario difficulty (simple, medium, complex)

Running Benchmark Tests

# Test all scenarios for a tool
uv run agentprobe benchmark vercel

# Test all tools
uv run agentprobe benchmark --all

# Generate report (placeholder)
uv run agentprobe report --format markdown

Architecture

AgentProbe follows a simple 4-component architecture:

  1. CLI Layer (cli.py) - Typer-based command interface with progress indicators
  2. Runner (runner.py) - Executes scenarios via Claude Code SDK with YAML frontmatter support
  3. Analyzer (analyzer.py) - AI-powered analysis using Claude to identify friction points
  4. Reporter (reporter.py) - AX-focused output for CLI developers

Agent Experience (AX) Analysis

AgentProbe uses Claude itself to analyze agent interactions, providing:

  • Intelligent Analysis: Claude analyzes execution traces to identify specific friction points
  • AX Scoring: Automatic scoring based on turn efficiency and success patterns
  • Contextual Recommendations: Actionable improvements tailored to each CLI tool
  • Consistency Tracking: Multi-run analysis to identify systematic issues

This approach avoids hardcoded patterns and provides nuanced, tool-specific insights that help CLI developers understand exactly where their tools create friction for AI agents.

Prompt Management & Versioning

AgentProbe uses externalized Jinja2 templates for analysis prompts:

  • Template-based Prompts: Analysis prompts are stored in prompts/analysis.jinja2 for easy editing and iteration
  • Version Tracking: Each analysis includes prompt version metadata for reproducible results
  • Dynamic Variables: Templates support contextual variables (scenario, tool, trace data)
  • Historical Comparison: Version tracking enables comparing results across prompt iterations
# Prompt templates are automatically loaded from prompts/ directory
# Version information is tracked in prompts/metadata.json
# Analysis results include prompt_version field for tracking

Requirements

  • Python 3.10+
  • uv package manager
  • Claude Code SDK (automatically installed)

Key Features

🎯 Agent Experience (AX) Focus

  • AX Scores (A-F) based on turn efficiency and success rate
  • Friction Point Analysis identifies specific CLI pain points
  • Actionable Recommendations for CLI developers

📊 Progress & Feedback

  • Real-time Progress with live turn count and elapsed time
  • Consistency Analysis across multiple runs
  • Expected vs Actual turn comparison using YAML metadata

🔧 Advanced Scenario Control

  • YAML Frontmatter for model selection, permissions, turn limits
  • Multiple Authentication methods with process isolation
  • Flexible Tool Configuration per scenario

Available Scenarios

Current test scenarios included:

  • GitHub CLI (gh/)
    • create-pr.txt - Create pull requests
  • Vercel (vercel/)
    • deploy.txt - Deploy applications to production
    • preview-deploy.txt - Deploy to preview environment
    • init-project.txt - Initialize new project with template
    • env-setup.txt - Configure environment variables
    • list-deployments.txt - List recent deployments
    • domain-setup.txt - Add custom domain configuration
    • rollback.txt - Rollback to previous deployment
    • logs.txt - View deployment logs
    • build-local.txt - Build project locally
    • ax-test.txt - Simple version check (AX demo)
    • yaml-options-test.txt - YAML frontmatter demo
  • Docker (docker/)
    • run-nginx.txt - Run nginx containers
  • Wrangler (Cloudflare) (wrangler/)
    • Multiple deployment and development scenarios

Browse all scenarios →

Development

# Install with dev dependencies
uv sync --extra dev

# Format code
uv run black src/

# Lint code
uv run ruff check src/

# Run tests (when implemented)
uv run pytest

See TASKS.md for the development roadmap and task tracking.

Programmatic Usage

import asyncio
from agentprobe import test_cli

async def main():
    result = await test_cli("gh", "create-pr")
    print(f"Success: {result['success']}")
    print(f"Duration: {result['duration_seconds']}s")
    print(f"Cost: ${result['cost_usd']:.3f}")

asyncio.run(main())

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentprobe-0.3.13.tar.gz (370.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentprobe-0.3.13-py3-none-any.whl (61.6 kB view details)

Uploaded Python 3

File details

Details for the file agentprobe-0.3.13.tar.gz.

File metadata

  • Download URL: agentprobe-0.3.13.tar.gz
  • Upload date:
  • Size: 370.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.20

File hashes

Hashes for agentprobe-0.3.13.tar.gz
Algorithm Hash digest
SHA256 7a340bc460fd26e233ca96e461820f7c02e9e1272342113770f27bc5c44c6955
MD5 c40468640bb5d50dc4dc2d9743404ddf
BLAKE2b-256 961c91e530de96528bd7637514aa4906db5d8d372a971e7819a98ab964396dfc

See more details on using hashes here.

File details

Details for the file agentprobe-0.3.13-py3-none-any.whl.

File metadata

File hashes

Hashes for agentprobe-0.3.13-py3-none-any.whl
Algorithm Hash digest
SHA256 1607956e739f50c7013465530996261dc5f50926481f9e1baac4682b1a067bd4
MD5 a0452697869958a2852bfda4125e8696
BLAKE2b-256 5c7074c9e653ec3b7af9becf0d021a8e5d70b316e0d0c03e99a52876cf1688a9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page