Skip to main content

Benchmark and optimize AI agent skill descriptions - the SEO for agent skills

Project description

Skills Arena

Is your skill winning the context window?

Website PyPI License GitHub Stars

Why?How It WorksQuick StartFeaturesCustom ScenariosConfiguration


Why Skills Arena?

Your skill's description is the most important copy you'll ever write. It's read by coding agents thousands of times a day, and it determines whether your product gets used or ignored.

Skills are the new SEO. Just like you used to optimize for Google's algorithm, you now need to optimize for the coding agent's decision-making process. Every day, thousands of decisions happen inside context windows — your skill vs. competitors, your description vs. theirs. And you have no idea who's winning.

┌─────────────────────────────────────────────────────────────────────────┐
│  Developer: "Find the latest AI news and summarize the key points"     │
│                                                                         │
│  Coding Agent's Context Window:                                         │
│    • Your Search Skill                                                  │
│    • Competitor's Web Scraper                                           │
│    • Built-in WebSearch                                                 │
│                                                                         │
│  ⚡ One satisfies the request. The rest are forgotten.                  │
│  📊 Skills Arena shows you who wins — and why.                          │
└─────────────────────────────────────────────────────────────────────────┘

Skills Arena is a skill discovery optimization framework — benchmark and improve how coding agents find and choose your skill.

You spent months building a great product. You wrote a skill so agents can use it. But right now, inside thousands of terminals, a coding agent is reading your skill description next to your competitor's — and choosing theirs. You never even knew it happened.

How It Works

                            ┌──────────────────────────────────────────────────────────┐
                            │              S C E N A R I O   G E N E R A T I O N       │
 ┌─────────────────┐        │                                                          │
 │   YOUR SKILL    │───────▶│   "Store embeddings"       → should pick: Your Skill    │
 │  vector-db.md   │        │   "Semantic search docs"   → should pick: Your Skill    │
 └─────────────────┘        │   "Scale to 1B vectors"    → should pick: Your Skill    │
                            │                                                          │
 ┌─────────────────┐        │   "Hybrid keyword+vector"  → should pick: Competitor    │
 │   COMPETITOR    │───────▶│   "Filter by metadata"     → should pick: Competitor    │
 │  rival-db.md    │        │                                                          │
 └─────────────────┘        └────────────────────────────┬─────────────────────────────┘
                                                         │
                                                         ▼
                            ┌──────────────────────────────────────────────────────────┐
                            │              A G E N T   S I M U L A T I O N             │
                            │                                                          │
                            │   Agent sees ALL skills in context, picks ONE per task  │
                            │                                                          │
                            │   ┌─────────────────────────────────────────────────┐    │
                            │   │ "Store embeddings"                              │    │
                            │   │  Expected: Your Skill                           │    │
                            │   │  Agent picked: Your Skill ✅ WIN                │    │
                            │   └─────────────────────────────────────────────────┘    │
                            │   ┌─────────────────────────────────────────────────┐    │
                            │   │ "Semantic search docs"                          │    │
                            │   │  Expected: Your Skill                           │    │
                            │   │  Agent picked: Competitor 🔴 STOLEN!            │    │
                            │   └─────────────────────────────────────────────────┘    │
                            └────────────────────────────┬─────────────────────────────┘
                                                         │
                                                         ▼
                            ┌──────────────────────────────────────────────────────────┐
                            │                    R E S U L T S                         │
                            │                                                          │
                            │   Your Skill        ████████████░░░░░░   60% selected    │
                            │   Competitor        ████████░░░░░░░░░░   40% selected    │
                            │                                                          │
                            │   🔴 STEALS: Competitor won 2 of your scenarios          │
                            │   🏆 WINNER: Your Skill (but watch those steals!)        │
                            └──────────────────────────────────────────────────────────┘

The flow:

  1. Input skills — yours and the competition
  2. Generate scenarios — prompts where each skill should be chosen
  3. Simulate — a real agent sees all skills and picks one per task
  4. Track — wins, losses, and steals (when competitors take your scenarios)
  5. Report — selection rates, reasoning, and actionable insights

Quick Start

Installation

pip install skills-arena

Compare Two Skills

from skills_arena import Arena, Config

arena = Arena()
results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    task="web search and content extraction",
)

print(f"Winner: {results.winner}")
print(f"Selection rates: {results.selection_rates}")

Output:

======================================================================
RESULTS
======================================================================

🏆 Winner: Competitor Skill

📊 Selection Rates:
  My Skill             ██████               30%
  Competitor Skill     ██████████████       70%

📋 Scenarios run: 10

----------------------------------------------------------------------
🔴 STEAL DETECTION
----------------------------------------------------------------------
  My Skill: Lost 2 scenario(s) to competitors

Optimize a Skill

Lost the comparison? Let the optimizer fix it:

result = arena.optimize(
    skill="./my-skill.md",
    competitors=["./competitor.md"],
    task="web search and content extraction",
    max_iterations=2,
)

print_results(result)

Output:

======================================================================
OPTIMIZATION RESULTS
======================================================================
Skill:       My Skill
Competitors: Competitor Skill
Scenarios:   6  |  Iterations: 2

Before -> After:
  Selection Rate:  ███████░░░░░░░░░░░░░ 33%  ->  █████████████░░░░░░░ 67%  (+34%)
  Grade:             F  ->  D
  Tokens:           43  ->  40  (-3)

----------------------------------------------------------------------
Iteration 1:  33% -> 67%  (+34%)  [improved]

  Added concrete usage examples, specified output format,
  and differentiated from scraping tools.

  Scenarios:  3 won  |  0 stolen

The optimizer runs a compare → rewrite → verify loop:

  1. Baseline comparison to measure current performance
  2. LLM rewrites the description using competition data and stolen scenario reasoning
  3. Verifies improvement using the same frozen scenarios
  4. Repeats if max_iterations > 1 (stops on regression)

Features

🎯 Realistic Skill Discovery

Skills Arena tests real skill discovery — skills are loaded naturally into the agent's context, exactly how your users experience it. No prompt injection, no artificial setup.

📊 Detailed Results with Reasoning

See exactly why the agent chose each skill:

[Scenario 1]
  Prompt: Find the latest AI news and summarize findings
  Designed for: My Skill
  Selected: Competitor Skill
  Agent's reasoning: I'll help you research AI news. Let me use the
                      competitor skill which handles web research...

🔴 Steal Detection

Know when competitors win scenarios designed for your skill:

🔴 STEAL DETECTION
  My Skill: Lost 2 scenario(s) to competitors
    - scenario-abc123
    - scenario-def456

🎮 Custom Scenarios (Power Users)

Define your own test cases for regression testing, edge cases, or real production prompts:

from skills_arena import Arena, CustomScenario

results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    scenarios=[
        CustomScenario(prompt="Find AI news"),  # Blind test
        CustomScenario(
            prompt="Scrape pricing from stripe.com",
            expected_skill="My Skill",  # Enables steal detection
        ),
    ],
)

🔀 Mix Custom + Generated Scenarios

from skills_arena import CustomScenario, GenerateScenarios

results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    task="web search",
    scenarios=[
        CustomScenario(prompt="My edge case"),
        GenerateScenarios(count=5),  # Generate 5 more with LLM
    ],
)

Configuration

from skills_arena import Arena, Config

config = Config(
    # Scenario generation
    scenarios=10,                       # Number of test scenarios
    scenario_strategy="per_skill",      # "per_skill" or "balanced"
    temperature=0.7,                    # Generation diversity

    # Agent framework
    agents=["claude-code"],             # Uses Claude Agent SDK

    # Execution
    timeout_seconds=60,                 # Per-scenario timeout
)

arena = Arena(config)

Scenario Strategies

Strategy Description
balanced Generate scenarios for all skills together (default)
per_skill Generate from each skill alone — reveals "steal rates"

Environment Variables

ANTHROPIC_API_KEY=sk-ant-...   # Required

API Reference

Arena Methods

Method Description
arena.evaluate(skill, task) Evaluate a single skill
arena.compare(skills, task) Compare multiple skills head-to-head
arena.battle_royale(skills, task) Full tournament with ELO rankings
arena.optimize(skill, competitors, task) Auto-improve a skill description

Result Objects

# ComparisonResult
results.winner              # Name of winning skill
results.selection_rates     # {skill_name: rate}
results.scenario_details    # List of ScenarioDetail
results.steals              # {skill_name: [stolen_scenario_ids]}
results.insights            # List of Insight

# ScenarioDetail
detail.prompt               # The test prompt
detail.expected_skill       # Which skill it was designed for
detail.selected_skill       # Which skill the agent chose
detail.reasoning            # Agent's text before selection
detail.was_stolen           # True if competitor won

# OptimizationResult
result.original_skill       # Skill before optimization
result.optimized_skill      # Best skill found
result.iterations           # List of OptimizationIteration
result.total_improvement    # Delta in selection rate
result.selection_rate_before  # Starting selection rate
result.selection_rate_after   # Final selection rate
result.grade_before         # Grade before (A+ to F)
result.grade_after          # Grade after

Custom Scenarios

from skills_arena import CustomScenario, GenerateScenarios

# Blind test (no expected skill)
CustomScenario(prompt="Find AI news")

# With expected skill (enables steal detection)
CustomScenario(
    prompt="Scrape the pricing table",
    expected_skill="Web Scraper",
    tags=["scraping", "pricing"],
)

# Generate N scenarios with LLM
GenerateScenarios(count=5)

Key Metrics

Metric Description What It Means
Selection Rate % of times your skill is chosen Your share of the context layer
Steal Rate % of your scenarios won by competitors Opportunities lost to alternatives
Defense Rate % of your scenarios you kept How well you hold your ground

Supported Agents

Agent Status Notes
Claude Code ✅ Supported Primary agent, uses Claude Agent SDK
Codex CLI 🔜 Coming OpenAI's coding agent
Gemini CLI 🔜 Coming Google's coding agent
Cursor 🔜 Planned IDE-integrated agent
Windsurf 🔜 Planned Codeium's coding agent

Supported Skill Formats

  • Claude Code.md skill files with YAML frontmatter
  • OpenAI — Function calling schemas (JSON)
  • MCP — Tool definitions
  • Generic — Plain text descriptions

Roadmap

  • Filesystem-based skill discovery
  • Custom scenarios for power users
  • Agent's reasoning capture
  • Steal detection
  • Auto-optimize skill descriptions
  • Web UI dashboard
  • Historical tracking & trends
  • skills.sh integration

Contributing

Contributions welcome! See ARCHITECTURE.md for technical details.

git clone https://github.com/Eyalbenba/skills-arena.git
cd skills-arena
pip install -e ".[dev]"
pytest

License

Apache License 2.0. See LICENSE for details.


Skills Arena — Skills are the new SEO.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skills_arena-0.3.0.tar.gz (86.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skills_arena-0.3.0-py3-none-any.whl (77.8 kB view details)

Uploaded Python 3

File details

Details for the file skills_arena-0.3.0.tar.gz.

File metadata

  • Download URL: skills_arena-0.3.0.tar.gz
  • Upload date:
  • Size: 86.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skills_arena-0.3.0.tar.gz
Algorithm Hash digest
SHA256 0985741b59565e2be1f2737d93f0d857425df4e4b3ed482f283ca47994762de2
MD5 67af1e5fe11f4234ab4914f3a50d1a18
BLAKE2b-256 ad9f88e8e85a4908e68935ab36ee03238ead9e2150b1efff993be58f0d7a4382

See more details on using hashes here.

Provenance

The following attestation bundles were made for skills_arena-0.3.0.tar.gz:

Publisher: release.yml on Eyalbenba/skills-arena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file skills_arena-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: skills_arena-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 77.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skills_arena-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 00d7a12b9ba6b4d73df30227f33c4b0619065c798d5200fa4d200c9a2f707f52
MD5 9a5ea70eb840679f2e3fa413a30a8053
BLAKE2b-256 e9229cb4f3a7b30cd81021a3734e5d2230ebe2164bfe663d210f15993914ffe8

See more details on using hashes here.

Provenance

The following attestation bundles were made for skills_arena-0.3.0-py3-none-any.whl:

Publisher: release.yml on Eyalbenba/skills-arena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page