Test how well AI agents interact with CLI tools

These details have not been verified by PyPI

Project links

Project description

AgentProbe

Test how well AI agents interact with your CLI tools. AgentProbe runs Claude Code against any command-line tool and provides actionable insights to improve Agent Experience (AX) - helping CLI developers make their tools more AI-friendly.

AgentProbe

Quick Start

# No installation needed - run directly with uvx
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy

# Or install locally for development
uv sync
uv run agentprobe test git --scenario status

Authentication

AgentProbe supports multiple authentication methods to avoid environment pollution:

Get an OAuth Token

First, obtain your OAuth token using Claude Code:

claude setup-token

This will guide you through the OAuth flow and provide a token for authentication.

Method 1: Token File (Recommended)

# Store token in a file (replace with your actual token from claude setup-token)
echo "your-oauth-token-here" > ~/.agentprobe-token

# Use with commands
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy --oauth-token-file ~/.agentprobe-token

Method 2: Config Files

Create a config file in one of these locations (checked in priority order):

# Global user config (replace with your actual token from claude setup-token)
mkdir -p ~/.agentprobe
echo "your-oauth-token-here" > ~/.agentprobe/config

# Project-specific config (add to .gitignore)
echo "your-oauth-token-here" > .agentprobe
echo ".agentprobe" >> .gitignore

# Then run normally without additional flags
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy

Method 3: Environment Variables (Legacy)

# Replace with your actual token from claude setup-token
export CLAUDE_CODE_OAUTH_TOKEN="your-oauth-token-here"
# Note: This may affect other Claude CLI processes

Recommendation: Use token files or config files for better process isolation.

What It Does

AgentProbe launches Claude Code to test CLI tools and provides Agent Experience (AX) insights on:

AX Score (A-F) based on turn count and success rate
CLI Friction Points - specific issues that confuse agents
Actionable Improvements - concrete changes to reduce agent friction
Real-time Progress - see agent progress with live turn counts

Community Benchmark

Help us build a comprehensive benchmark of CLI tools! The table below shows how well Claude Code handles various CLIs.

Tool	Scenarios	Passing	Failing	Success Rate	Last Updated
vercel	9	7	2	77.8%	2025-01-20
gh	1	1	0	100%	2025-01-20
docker	1	1	0	100%	2025-01-20

View detailed results →

Commands

Test Individual Scenarios

# Test a specific scenario (with uvx)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr

# With authentication token file
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --oauth-token-file ~/.agentprobe-token

# Test multiple runs for consistency analysis
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy --runs 5

# With custom working directory
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test docker --scenario run-nginx --work-dir /path/to/project

# Show detailed trace with message debugging (disables progress indicators)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose

# ⚠️ DANGEROUS: Run without permission prompts (use only in safe environments)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test docker --scenario run-nginx --yolo

Benchmark Tools

# Test all scenarios for one tool
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel

# Test all scenarios with authentication
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel --oauth-token-file ~/.agentprobe-token

# Test all available tools and scenarios
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark --all

# ⚠️ DANGEROUS: Run all benchmarks without permission prompts
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark --all --yolo

Reports

# Generate reports (future feature)
uv run agentprobe report --format markdown --output results.md

Debugging and Verbose Output

The --verbose flag provides detailed insights into how Claude Code interacts with your CLI:

# Show full message trace with object types and attributes
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose

Verbose output includes:

Message object types (SystemMessage, AssistantMessage, UserMessage, ResultMessage)
Message content and tool usage
SDK object attributes and debugging information
Full conversation trace between Claude and your CLI

⚠️ YOLO Mode (Use with Extreme Caution)

The --yolo flag enables autonomous execution without permission prompts, allowing Claude to run ANY command without user approval:

# WARNING: Only use in isolated, safe environments
agentprobe test docker --scenario build-app --yolo

Security Considerations:

ONLY use in containerized or sandboxed environments
Claude can execute arbitrary commands including rm -rf, network calls, system modifications
No safety guardrails - Claude has full system access
Intended for CI/CD pipelines, testing environments, or research purposes
NEVER use on production systems or with sensitive data

This mode is equivalent to running Claude Code with --dangerously-skip-permissions.

Example Output

Single Run (Default)

⠋ Agent running... (Turn 3, 12s)
╭───────────────────────────── AgentProbe Results ─────────────────────────────╮
│ Tool: vercel | Scenario: deploy                                               │
│ AX Score: B (12 turns, 80% success rate)                                      │
│                                                                               │
│ Agent Experience Summary:                                                     │
│ Agent completed deployment but needed extra turns due to unclear progress     │
│ feedback and ambiguous success indicators.                                    │
│                                                                               │
│ CLI Friction Points:                                                          │
│ • No progress feedback during build process                                   │
│ • Deployment URL returned before actual completion                            │
│ • Success status ambiguous ("building" vs "deployed")                        │
│                                                                               │
│ Top Improvements for CLI:                                                     │
│ 1. Add --status flag to check deployment progress                             │
│ 2. Include completion status in deployment output                             │
│ 3. Provide structured --json output for programmatic usage                    │
│                                                                               │
│ Expected turns: 5-8 | Duration: 23.4s | Cost: $0.012                         │
│                                                                               │
│ Use --verbose for full trace analysis                                         │
╰───────────────────────────────────────────────────────────────────────────────╯

Multiple Runs (Aggregate)

╭──────────────────────── AgentProbe Aggregate Results ────────────────────────╮
│ Tool: vercel | Scenario: deploy                                               │
│ AX Score: C (14.2 avg turns, 60% success rate) | Runs: 5                      │
│                                                                               │
│ Consistency Analysis:                                                         │
│ • Turn variance: 8-22 turns                                                   │
│ • Success consistency: 60% of runs succeeded                                  │
│ • Agent confusion points: 18 total friction events                            │
│                                                                               │
│ Consistent CLI Friction Points:                                               │
│ • Permission errors lack clear remediation steps                              │
│ • No progress feedback during deployment                                      │
│ • Build failures don't suggest next steps                                     │
│                                                                               │
│ Priority Improvements for CLI:                                                │
│ 1. Add deployment status polling with vercel status                           │
│ 2. Include troubleshooting hints in error messages                            │
│ 3. Provide progress indicators during long operations                          │
│                                                                               │
│ Avg duration: 45.2s | Total cost: $0.156                                      │
╰───────────────────────────────────────────────────────────────────────────────╯

Contributing Scenarios

We welcome scenario contributions! Help us test more CLI tools:

Fork this repository
Add your scenarios under scenarios/<tool-name>/
Run the tests and update the benchmark table
Submit a PR with your results

Scenario Format

Simple Text Format

Create simple text files with clear prompts:

# scenarios/stripe/create-customer.txt
Create a new Stripe customer with email test@example.com and
add a test credit card. Return the customer ID.

Enhanced YAML Format (Recommended)

Use YAML frontmatter for better control and metadata:

# scenarios/vercel/deploy-complex.txt
---
version: 2
created: 2025-01-22
tool: vercel
permission_mode: acceptEdits
allowed_tools: [Read, Write, Bash]
model: opus
max_turns: 15
complexity: complex
expected_turns: 8-12
description: "Production deployment with environment setup"
---
Deploy this Next.js application to production using Vercel CLI.
Configure production environment variables and ensure the deployment
is successful with proper domain configuration.

YAML Frontmatter Options:

model: Override default model (sonnet, opus)
max_turns: Limit agent interactions
permission_mode: Set permissions (acceptEdits, default, plan, bypassPermissions)
allowed_tools: Specify tools ([Read, Write, Bash])
expected_turns: Range for AX scoring comparison
complexity: Scenario difficulty (simple, medium, complex)

Running Benchmark Tests

# Test all scenarios for a tool
uv run agentprobe benchmark vercel

# Test all tools
uv run agentprobe benchmark --all

# Generate report (placeholder)
uv run agentprobe report --format markdown

Architecture

AgentProbe follows a simple 4-component architecture:

CLI Layer (cli.py) - Typer-based command interface with progress indicators
Runner (runner.py) - Executes scenarios via Claude Code SDK with YAML frontmatter support
Analyzer (analyzer.py) - AI-powered analysis using Claude to identify friction points
Reporter (reporter.py) - AX-focused output for CLI developers

Agent Experience (AX) Analysis

AgentProbe uses Claude itself to analyze agent interactions, providing:

Intelligent Analysis: Claude analyzes execution traces to identify specific friction points
AX Scoring: Automatic scoring based on turn efficiency and success patterns
Contextual Recommendations: Actionable improvements tailored to each CLI tool
Consistency Tracking: Multi-run analysis to identify systematic issues

This approach avoids hardcoded patterns and provides nuanced, tool-specific insights that help CLI developers understand exactly where their tools create friction for AI agents.

Prompt Management & Versioning

AgentProbe uses externalized Jinja2 templates for analysis prompts:

Template-based Prompts: Analysis prompts are stored in prompts/analysis.jinja2 for easy editing and iteration
Version Tracking: Each analysis includes prompt version metadata for reproducible results
Dynamic Variables: Templates support contextual variables (scenario, tool, trace data)
Historical Comparison: Version tracking enables comparing results across prompt iterations

# Prompt templates are automatically loaded from prompts/ directory
# Version information is tracked in prompts/metadata.json
# Analysis results include prompt_version field for tracking

Requirements

Python 3.10+
uv package manager
Claude Code SDK (automatically installed)

Key Features

🎯 Agent Experience (AX) Focus

AX Scores (A-F) based on turn efficiency and success rate
Friction Point Analysis identifies specific CLI pain points
Actionable Recommendations for CLI developers

📊 Progress & Feedback

Real-time Progress with live turn count and elapsed time
Consistency Analysis across multiple runs
Expected vs Actual turn comparison using YAML metadata

🔧 Advanced Scenario Control

YAML Frontmatter for model selection, permissions, turn limits
Multiple Authentication methods with process isolation
Flexible Tool Configuration per scenario

Available Scenarios

Current test scenarios included:

GitHub CLI (gh/)
- create-pr.txt - Create pull requests
Vercel (vercel/)
- deploy.txt - Deploy applications to production
- preview-deploy.txt - Deploy to preview environment
- init-project.txt - Initialize new project with template
- env-setup.txt - Configure environment variables
- list-deployments.txt - List recent deployments
- domain-setup.txt - Add custom domain configuration
- rollback.txt - Rollback to previous deployment
- logs.txt - View deployment logs
- build-local.txt - Build project locally
- ax-test.txt - Simple version check (AX demo)
- yaml-options-test.txt - YAML frontmatter demo
Docker (docker/)
- run-nginx.txt - Run nginx containers
Wrangler (Cloudflare) (wrangler/)
- Multiple deployment and development scenarios

Browse all scenarios →

Development

# Install with dev dependencies
uv sync --extra dev

# Format code
uv run black src/

# Lint code
uv run ruff check src/

# Run tests (when implemented)
uv run pytest

See TASKS.md for the development roadmap and task tracking.

Programmatic Usage

import asyncio
from agentprobe import test_cli

async def main():
    result = await test_cli("gh", "create-pr")
    print(f"Success: {result['success']}")
    print(f"Duration: {result['duration_seconds']}s")
    print(f"Cost: ${result['cost_usd']:.3f}")

asyncio.run(main())

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.13

Aug 17, 2025

0.3.12

Aug 17, 2025

0.3.11

Aug 6, 2025

0.3.10

Aug 6, 2025

0.3.8

Aug 6, 2025

0.3.7

Aug 6, 2025

0.3.6

Aug 6, 2025

0.3.5

Aug 6, 2025

0.3.3

Aug 6, 2025

0.3.2

Aug 6, 2025

0.3.1

Aug 5, 2025

0.3.0

Aug 5, 2025

0.2.1

Aug 4, 2025

0.2.0

Aug 4, 2025

0.1.4

Aug 4, 2025

0.1.3

Jul 31, 2025

0.1.2

Jul 31, 2025

0.1.1

Jul 31, 2025

0.1.0

Jul 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentprobe-0.3.13.tar.gz (370.6 kB view details)

Uploaded Aug 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentprobe-0.3.13-py3-none-any.whl (61.6 kB view details)

Uploaded Aug 17, 2025 Python 3

File details

Details for the file agentprobe-0.3.13.tar.gz.

File metadata

Download URL: agentprobe-0.3.13.tar.gz
Upload date: Aug 17, 2025
Size: 370.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.20

File hashes

Hashes for agentprobe-0.3.13.tar.gz
Algorithm	Hash digest
SHA256	`7a340bc460fd26e233ca96e461820f7c02e9e1272342113770f27bc5c44c6955`
MD5	`c40468640bb5d50dc4dc2d9743404ddf`
BLAKE2b-256	`961c91e530de96528bd7637514aa4906db5d8d372a971e7819a98ab964396dfc`

See more details on using hashes here.

File details

Details for the file agentprobe-0.3.13-py3-none-any.whl.

File metadata

Download URL: agentprobe-0.3.13-py3-none-any.whl
Upload date: Aug 17, 2025
Size: 61.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.20

File hashes

Hashes for agentprobe-0.3.13-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1607956e739f50c7013465530996261dc5f50926481f9e1baac4682b1a067bd4`
MD5	`a0452697869958a2852bfda4125e8696`
BLAKE2b-256	`5c7074c9e653ec3b7af9becf0d021a8e5d70b316e0d0c03e99a52876cf1688a9`

See more details on using hashes here.

agentprobe 0.3.13

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AgentProbe

Quick Start

Authentication

Get an OAuth Token

Method 1: Token File (Recommended)

Method 2: Config Files

Method 3: Environment Variables (Legacy)

What It Does

Community Benchmark

Commands

Test Individual Scenarios

Benchmark Tools

Reports

Debugging and Verbose Output

⚠️ YOLO Mode (Use with Extreme Caution)

Example Output

Single Run (Default)

Multiple Runs (Aggregate)

Contributing Scenarios

Scenario Format

Simple Text Format

Enhanced YAML Format (Recommended)

Running Benchmark Tests

Architecture

Agent Experience (AX) Analysis

Prompt Management & Versioning

Requirements

Key Features

🎯 Agent Experience (AX) Focus

📊 Progress & Feedback

🔧 Advanced Scenario Control

Available Scenarios

Development

Programmatic Usage

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes