Skip to main content

Benchmark and optimize AI agent skill descriptions - the SEO for agent skills

Project description

Skills Arena

Companies are competing for context. Is your skill winning?

PyPI License GitHub Stars

Why?How It WorksQuick StartFeaturesCustom ScenariosConfiguration


Why Skills Arena?

The battleground has moved. Developers don't Google for tools — they ask their AI assistant. And in that moment, your skill is either chosen or invisible.

Every day, thousands of decisions happen inside AI context windows. Your skill vs. competitors. Your description vs. theirs. And you have no idea who's winning.

┌─────────────────────────────────────────────────────────────────────────┐
│  Developer: "Find the latest AI news and summarize the key points"     │
│                                                                         │
│  Agent's Context Window:                                                │
│    • Your Search Skill                                                  │
│    • Competitor's Web Scraper                                           │
│    • Built-in WebSearch                                                 │
│                                                                         │
│  ⚡ One satisfies the request. The rest are forgotten.                  │
│  📊 Skills Arena shows you who wins — and why.                          │
└─────────────────────────────────────────────────────────────────────────┘

Skills Arena lets you benchmark the context layer — see how your skill performs against the competition before your users do.

How It Works

                            ┌──────────────────────────────────────────────────────────┐
                            │              S C E N A R I O   G E N E R A T I O N       │
 ┌─────────────────┐        │                                                          │
 │   YOUR SKILL    │───────▶│   "Store embeddings"       → should pick: Your Skill    │
 │  vector-db.md   │        │   "Semantic search docs"   → should pick: Your Skill    │
 └─────────────────┘        │   "Scale to 1B vectors"    → should pick: Your Skill    │
                            │                                                          │
 ┌─────────────────┐        │   "Hybrid keyword+vector"  → should pick: Competitor    │
 │   COMPETITOR    │───────▶│   "Filter by metadata"     → should pick: Competitor    │
 │  rival-db.md    │        │                                                          │
 └─────────────────┘        └────────────────────────────┬─────────────────────────────┘
                                                         │
                                                         ▼
                            ┌──────────────────────────────────────────────────────────┐
                            │              A G E N T   S I M U L A T I O N             │
                            │                                                          │
                            │   Agent sees ALL skills in context, picks ONE per task  │
                            │                                                          │
                            │   ┌─────────────────────────────────────────────────┐    │
                            │   │ "Store embeddings"                              │    │
                            │   │  Expected: Your Skill                           │    │
                            │   │  Agent picked: Your Skill ✅ WIN                │    │
                            │   └─────────────────────────────────────────────────┘    │
                            │   ┌─────────────────────────────────────────────────┐    │
                            │   │ "Semantic search docs"                          │    │
                            │   │  Expected: Your Skill                           │    │
                            │   │  Agent picked: Competitor 🔴 STOLEN!            │    │
                            │   └─────────────────────────────────────────────────┘    │
                            └────────────────────────────┬─────────────────────────────┘
                                                         │
                                                         ▼
                            ┌──────────────────────────────────────────────────────────┐
                            │                    R E S U L T S                         │
                            │                                                          │
                            │   Your Skill        ████████████░░░░░░   60% selected    │
                            │   Competitor        ████████░░░░░░░░░░   40% selected    │
                            │                                                          │
                            │   🔴 STEALS: Competitor won 2 of your scenarios          │
                            │   🏆 WINNER: Your Skill (but watch those steals!)        │
                            └──────────────────────────────────────────────────────────┘

The flow:

  1. Input skills — yours and the competition
  2. Generate scenarios — prompts where each skill should be chosen
  3. Simulate — a real agent sees all skills and picks one per task
  4. Track — wins, losses, and steals (when competitors take your scenarios)
  5. Report — selection rates, reasoning, and actionable insights

Quick Start

Installation

pip install skills-arena

Compare Two Skills

from skills_arena import Arena, Config

arena = Arena()
results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    task="web search and content extraction",
)

print(f"Winner: {results.winner}")
print(f"Selection rates: {results.selection_rates}")

Output:

======================================================================
RESULTS
======================================================================

🏆 Winner: Competitor Skill

📊 Selection Rates:
  My Skill             ██████               30%
  Competitor Skill     ██████████████       70%

📋 Scenarios run: 10

----------------------------------------------------------------------
🔴 STEAL DETECTION
----------------------------------------------------------------------
  My Skill: Lost 2 scenario(s) to competitors

Features

🎯 Realistic Skill Discovery

Skills Arena tests real skill discovery — skills are loaded naturally into the agent's context, exactly how your users experience it. No prompt injection, no artificial setup.

📊 Detailed Results with Reasoning

See exactly why the agent chose each skill:

[Scenario 1]
  Prompt: Find the latest AI news and summarize findings
  Designed for: My Skill
  Selected: Competitor Skill
  Agent's reasoning: I'll help you research AI news. Let me use the
                      competitor skill which handles web research...

🔴 Steal Detection

Know when competitors win scenarios designed for your skill:

🔴 STEAL DETECTION
  My Skill: Lost 2 scenario(s) to competitors
    - scenario-abc123
    - scenario-def456

🎮 Custom Scenarios (Power Users)

Define your own test cases for regression testing, edge cases, or real production prompts:

from skills_arena import Arena, CustomScenario

results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    scenarios=[
        CustomScenario(prompt="Find AI news"),  # Blind test
        CustomScenario(
            prompt="Scrape pricing from stripe.com",
            expected_skill="My Skill",  # Enables steal detection
        ),
    ],
)

🔀 Mix Custom + Generated Scenarios

from skills_arena import CustomScenario, GenerateScenarios

results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    task="web search",
    scenarios=[
        CustomScenario(prompt="My edge case"),
        GenerateScenarios(count=5),  # Generate 5 more with LLM
    ],
)

Configuration

from skills_arena import Arena, Config

config = Config(
    # Scenario generation
    scenarios=10,                       # Number of test scenarios
    scenario_strategy="per_skill",      # "per_skill" or "balanced"
    temperature=0.7,                    # Generation diversity

    # Agent framework
    agents=["claude-code"],             # Uses Claude Agent SDK

    # Execution
    timeout_seconds=60,                 # Per-scenario timeout
)

arena = Arena(config)

Scenario Strategies

Strategy Description
balanced Generate scenarios for all skills together (default)
per_skill Generate from each skill alone — reveals "steal rates"

Environment Variables

ANTHROPIC_API_KEY=sk-ant-...   # Required

API Reference

Arena Methods

Method Description
arena.evaluate(skill, task) Evaluate a single skill
arena.compare(skills, task) Compare multiple skills head-to-head
arena.battle_royale(skills, task) Full tournament with ELO rankings

Result Objects

# ComparisonResult
results.winner              # Name of winning skill
results.selection_rates     # {skill_name: rate}
results.scenario_details    # List of ScenarioDetail
results.steals              # {skill_name: [stolen_scenario_ids]}
results.insights            # List of Insight

# ScenarioDetail
detail.prompt               # The test prompt
detail.expected_skill       # Which skill it was designed for
detail.selected_skill       # Which skill the agent chose
detail.reasoning            # Agent's text before selection
detail.was_stolen           # True if competitor won

Custom Scenarios

from skills_arena import CustomScenario, GenerateScenarios

# Blind test (no expected skill)
CustomScenario(prompt="Find AI news")

# With expected skill (enables steal detection)
CustomScenario(
    prompt="Scrape the pricing table",
    expected_skill="Web Scraper",
    tags=["scraping", "pricing"],
)

# Generate N scenarios with LLM
GenerateScenarios(count=5)

Key Metrics

Metric Description What It Means
Selection Rate % of times your skill is chosen Your share of the context layer
Steal Rate % of your scenarios won by competitors Opportunities lost to alternatives
Defense Rate % of your scenarios you kept How well you hold your ground

Supported Agents

Agent Status Notes
Claude Code ✅ Supported Primary agent, uses Claude Agent SDK
Codex CLI 🔜 Coming OpenAI's coding agent
Gemini CLI 🔜 Coming Google's coding agent
Cursor 🔜 Planned IDE-integrated agent
Windsurf 🔜 Planned Codeium's coding agent

Supported Skill Formats

  • Claude Code.md skill files with YAML frontmatter
  • OpenAI — Function calling schemas (JSON)
  • MCP — Tool definitions
  • Generic — Plain text descriptions

Roadmap

  • Filesystem-based skill discovery
  • Custom scenarios for power users
  • Agent's reasoning capture
  • Steal detection
  • Web UI dashboard
  • Historical tracking & trends
  • A/B testing for skill descriptions
  • skills.sh integration

Contributing

Contributions welcome! See ARCHITECTURE.md for technical details.

git clone https://github.com/Eyalbenba/skills-arena.git
cd skills-arena
pip install -e ".[dev]"
pytest

License

Apache License 2.0. See LICENSE for details.


Skills Arena — Penetrate the context layer.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skills_arena-0.1.1.tar.gz (71.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skills_arena-0.1.1-py3-none-any.whl (66.9 kB view details)

Uploaded Python 3

File details

Details for the file skills_arena-0.1.1.tar.gz.

File metadata

  • Download URL: skills_arena-0.1.1.tar.gz
  • Upload date:
  • Size: 71.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skills_arena-0.1.1.tar.gz
Algorithm Hash digest
SHA256 91ba00c73f55e04dff32b4681c40e5e2bf436c9a6898099f2d02f2e305f0807a
MD5 9560d065ae242823d1ef58ab94f91c29
BLAKE2b-256 8f5d084a052c12a68c67516dc6482e2c616d7c02ae22b25607b81cbcb03b144e

See more details on using hashes here.

Provenance

The following attestation bundles were made for skills_arena-0.1.1.tar.gz:

Publisher: release.yml on Eyalbenba/skills-arena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file skills_arena-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: skills_arena-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 66.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skills_arena-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9cf80ddfa2f5f3aa190e61de4e7ee6779f6ce645485939569ae2cd1380876f42
MD5 96a382ede8b7de3b6660ed6927d1450d
BLAKE2b-256 6b83b4035a59ff825f4340dbd78aeaae14fd22593a582044e4b0a5b058aa347f

See more details on using hashes here.

Provenance

The following attestation bundles were made for skills_arena-0.1.1-py3-none-any.whl:

Publisher: release.yml on Eyalbenba/skills-arena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page