Skip to main content

Benchmark and optimize AI agent skill descriptions - the SEO for agent skills

Project description

Skills Arena

Companies are competing for context. Is your skill winning?

PyPI License GitHub Stars

Why?How It WorksQuick StartFeaturesCustom ScenariosConfiguration


Why Skills Arena?

The battleground has moved. Developers don't Google for tools — they ask their AI assistant. And in that moment, your skill is either chosen or invisible.

Every day, thousands of decisions happen inside AI context windows. Your skill vs. competitors. Your description vs. theirs. And you have no idea who's winning.

┌─────────────────────────────────────────────────────────────────────────┐
│  Developer: "Find the latest AI news and summarize the key points"     │
│                                                                         │
│  Agent's Context Window:                                                │
│    • Your Search Skill                                                  │
│    • Competitor's Web Scraper                                           │
│    • Built-in WebSearch                                                 │
│                                                                         │
│  ⚡ One satisfies the request. The rest are forgotten.                  │
│  📊 Skills Arena shows you who wins — and why.                          │
└─────────────────────────────────────────────────────────────────────────┘

Skills Arena lets you benchmark the context layer — see how your skill performs against the competition before your users do.

How It Works

                            ┌──────────────────────────────────────────────────────────┐
                            │              S C E N A R I O   G E N E R A T I O N       │
 ┌─────────────────┐        │                                                          │
 │   YOUR SKILL    │───────▶│   "Store embeddings"       → should pick: Your Skill    │
 │  vector-db.md   │        │   "Semantic search docs"   → should pick: Your Skill    │
 └─────────────────┘        │   "Scale to 1B vectors"    → should pick: Your Skill    │
                            │                                                          │
 ┌─────────────────┐        │   "Hybrid keyword+vector"  → should pick: Competitor    │
 │   COMPETITOR    │───────▶│   "Filter by metadata"     → should pick: Competitor    │
 │  rival-db.md    │        │                                                          │
 └─────────────────┘        └────────────────────────────┬─────────────────────────────┘
                                                         │
                                                         ▼
                            ┌──────────────────────────────────────────────────────────┐
                            │              A G E N T   S I M U L A T I O N             │
                            │                                                          │
                            │   Agent sees ALL skills in context, picks ONE per task  │
                            │                                                          │
                            │   ┌─────────────────────────────────────────────────┐    │
                            │   │ "Store embeddings"                              │    │
                            │   │  Expected: Your Skill                           │    │
                            │   │  Agent picked: Your Skill ✅ WIN                │    │
                            │   └─────────────────────────────────────────────────┘    │
                            │   ┌─────────────────────────────────────────────────┐    │
                            │   │ "Semantic search docs"                          │    │
                            │   │  Expected: Your Skill                           │    │
                            │   │  Agent picked: Competitor 🔴 STOLEN!            │    │
                            │   └─────────────────────────────────────────────────┘    │
                            └────────────────────────────┬─────────────────────────────┘
                                                         │
                                                         ▼
                            ┌──────────────────────────────────────────────────────────┐
                            │                    R E S U L T S                         │
                            │                                                          │
                            │   Your Skill        ████████████░░░░░░   60% selected    │
                            │   Competitor        ████████░░░░░░░░░░   40% selected    │
                            │                                                          │
                            │   🔴 STEALS: Competitor won 2 of your scenarios          │
                            │   🏆 WINNER: Your Skill (but watch those steals!)        │
                            └──────────────────────────────────────────────────────────┘

The flow:

  1. Input skills — yours and the competition
  2. Generate scenarios — prompts where each skill should be chosen
  3. Simulate — a real agent sees all skills and picks one per task
  4. Track — wins, losses, and steals (when competitors take your scenarios)
  5. Report — selection rates, reasoning, and actionable insights

Quick Start

Installation

pip install skills-arena

Compare Two Skills

from skills_arena import Arena, Config

arena = Arena()
results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    task="web search and content extraction",
)

print(f"Winner: {results.winner}")
print(f"Selection rates: {results.selection_rates}")

Output:

======================================================================
RESULTS
======================================================================

🏆 Winner: Competitor Skill

📊 Selection Rates:
  My Skill             ██████               30%
  Competitor Skill     ██████████████       70%

📋 Scenarios run: 10

----------------------------------------------------------------------
🔴 STEAL DETECTION
----------------------------------------------------------------------
  My Skill: Lost 2 scenario(s) to competitors

Features

🎯 Realistic Skill Discovery

Skills Arena tests real skill discovery — skills are loaded naturally into the agent's context, exactly how your users experience it. No prompt injection, no artificial setup.

📊 Detailed Results with Reasoning

See exactly why the agent chose each skill:

[Scenario 1]
  Prompt: Find the latest AI news and summarize findings
  Designed for: My Skill
  Selected: Competitor Skill
  Agent's reasoning: I'll help you research AI news. Let me use the
                      competitor skill which handles web research...

🔴 Steal Detection

Know when competitors win scenarios designed for your skill:

🔴 STEAL DETECTION
  My Skill: Lost 2 scenario(s) to competitors
    - scenario-abc123
    - scenario-def456

🎮 Custom Scenarios (Power Users)

Define your own test cases for regression testing, edge cases, or real production prompts:

from skills_arena import Arena, CustomScenario

results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    scenarios=[
        CustomScenario(prompt="Find AI news"),  # Blind test
        CustomScenario(
            prompt="Scrape pricing from stripe.com",
            expected_skill="My Skill",  # Enables steal detection
        ),
    ],
)

🔀 Mix Custom + Generated Scenarios

from skills_arena import CustomScenario, GenerateScenarios

results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    task="web search",
    scenarios=[
        CustomScenario(prompt="My edge case"),
        GenerateScenarios(count=5),  # Generate 5 more with LLM
    ],
)

Configuration

from skills_arena import Arena, Config

config = Config(
    # Scenario generation
    scenarios=10,                       # Number of test scenarios
    scenario_strategy="per_skill",      # "per_skill" or "balanced"
    temperature=0.7,                    # Generation diversity

    # Agent framework
    agents=["claude-code"],             # Uses Claude Agent SDK

    # Execution
    timeout_seconds=60,                 # Per-scenario timeout
)

arena = Arena(config)

Scenario Strategies

Strategy Description
balanced Generate scenarios for all skills together (default)
per_skill Generate from each skill alone — reveals "steal rates"

Environment Variables

ANTHROPIC_API_KEY=sk-ant-...   # Required

API Reference

Arena Methods

Method Description
arena.evaluate(skill, task) Evaluate a single skill
arena.compare(skills, task) Compare multiple skills head-to-head
arena.battle_royale(skills, task) Full tournament with ELO rankings

Result Objects

# ComparisonResult
results.winner              # Name of winning skill
results.selection_rates     # {skill_name: rate}
results.scenario_details    # List of ScenarioDetail
results.steals              # {skill_name: [stolen_scenario_ids]}
results.insights            # List of Insight

# ScenarioDetail
detail.prompt               # The test prompt
detail.expected_skill       # Which skill it was designed for
detail.selected_skill       # Which skill the agent chose
detail.reasoning            # Agent's text before selection
detail.was_stolen           # True if competitor won

Custom Scenarios

from skills_arena import CustomScenario, GenerateScenarios

# Blind test (no expected skill)
CustomScenario(prompt="Find AI news")

# With expected skill (enables steal detection)
CustomScenario(
    prompt="Scrape the pricing table",
    expected_skill="Web Scraper",
    tags=["scraping", "pricing"],
)

# Generate N scenarios with LLM
GenerateScenarios(count=5)

Key Metrics

Metric Description What It Means
Selection Rate % of times your skill is chosen Your share of the context layer
Steal Rate % of your scenarios won by competitors Opportunities lost to alternatives
Defense Rate % of your scenarios you kept How well you hold your ground

Supported Agents

Agent Status Notes
Claude Code ✅ Supported Primary agent, uses Claude Agent SDK
Codex CLI 🔜 Coming OpenAI's coding agent
Gemini CLI 🔜 Coming Google's coding agent
Cursor 🔜 Planned IDE-integrated agent
Windsurf 🔜 Planned Codeium's coding agent

Supported Skill Formats

  • Claude Code.md skill files with YAML frontmatter
  • OpenAI — Function calling schemas (JSON)
  • MCP — Tool definitions
  • Generic — Plain text descriptions

Roadmap

  • Filesystem-based skill discovery
  • Custom scenarios for power users
  • Agent's reasoning capture
  • Steal detection
  • Web UI dashboard
  • Historical tracking & trends
  • A/B testing for skill descriptions
  • skills.sh integration

Contributing

Contributions welcome! See ARCHITECTURE.md for technical details.

git clone https://github.com/Eyalbenba/skills-arena.git
cd skills-arena
pip install -e ".[dev]"
pytest

License

MIT License. See LICENSE for details.


Skills Arena — Penetrate the context layer.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skills_arena-0.1.0.tar.gz (68.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skills_arena-0.1.0-py3-none-any.whl (66.3 kB view details)

Uploaded Python 3

File details

Details for the file skills_arena-0.1.0.tar.gz.

File metadata

  • Download URL: skills_arena-0.1.0.tar.gz
  • Upload date:
  • Size: 68.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skills_arena-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ce07158ac8cc195bd4e0ffc326b0606cc323aa34830b66decb69b55839602380
MD5 0de10d7d4e710f5ea6cb04121c57c3f8
BLAKE2b-256 db0cfc97ce21b879d76c854d38736bed70d0d43bcb4f857f6fe77ec38b4955da

See more details on using hashes here.

Provenance

The following attestation bundles were made for skills_arena-0.1.0.tar.gz:

Publisher: release.yml on Eyalbenba/skills-arena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file skills_arena-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: skills_arena-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 66.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skills_arena-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e5924ecfcc045611467caec0f6f456ca1357f065b2765968ddb63754578611c9
MD5 7b78f97398605bf0fc07353bbd307f70
BLAKE2b-256 40c035071f311b7311735391a88f50ac5bc351d01f34336b18593111acbf68b7

See more details on using hashes here.

Provenance

The following attestation bundles were made for skills_arena-0.1.0-py3-none-any.whl:

Publisher: release.yml on Eyalbenba/skills-arena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page