skills-arena

Benchmark and optimize AI agent skill descriptions - the SEO for agent skills

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Eyalbenb

These details have not been verified by PyPI

Project description

Skills Arena

Is your skill winning the context window?

Why? • How It Works • Quick Start • Features • Custom Scenarios • Configuration

Why Skills Arena?

Your skill's description is the most important copy you'll ever write. It's read by coding agents thousands of times a day, and it determines whether your product gets used or ignored.

Skills are the new SEO. Just like you used to optimize for Google's algorithm, you now need to optimize for the coding agent's decision-making process. Every day, thousands of decisions happen inside context windows — your skill vs. competitors, your description vs. theirs. And you have no idea who's winning.

┌─────────────────────────────────────────────────────────────────────────┐
│  Developer: "Find the latest AI news and summarize the key points"     │
│                                                                         │
│  Coding Agent's Context Window:                                         │
│    • Your Search Skill                                                  │
│    • Competitor's Web Scraper                                           │
│    • Built-in WebSearch                                                 │
│                                                                         │
│  ⚡ One satisfies the request. The rest are forgotten.                  │
│  📊 Skills Arena shows you who wins — and why.                          │
└─────────────────────────────────────────────────────────────────────────┘

Skills Arena is a skill discovery optimization framework — benchmark and improve how coding agents find and choose your skill.

You spent months building a great product. You wrote a skill so agents can use it. But right now, inside thousands of terminals, a coding agent is reading your skill description next to your competitor's — and choosing theirs. You never even knew it happened.

How It Works

                            ┌──────────────────────────────────────────────────────────┐
                            │              S C E N A R I O   G E N E R A T I O N       │
 ┌─────────────────┐        │                                                          │
 │   YOUR SKILL    │───────▶│   "Store embeddings"       → should pick: Your Skill    │
 │  vector-db.md   │        │   "Semantic search docs"   → should pick: Your Skill    │
 └─────────────────┘        │   "Scale to 1B vectors"    → should pick: Your Skill    │
                            │                                                          │
 ┌─────────────────┐        │   "Hybrid keyword+vector"  → should pick: Competitor    │
 │   COMPETITOR    │───────▶│   "Filter by metadata"     → should pick: Competitor    │
 │  rival-db.md    │        │                                                          │
 └─────────────────┘        └────────────────────────────┬─────────────────────────────┘
                                                         │
                                                         ▼
                            ┌──────────────────────────────────────────────────────────┐
                            │              A G E N T   S I M U L A T I O N             │
                            │                                                          │
                            │   Agent sees ALL skills in context, picks ONE per task  │
                            │                                                          │
                            │   ┌─────────────────────────────────────────────────┐    │
                            │   │ "Store embeddings"                              │    │
                            │   │  Expected: Your Skill                           │    │
                            │   │  Agent picked: Your Skill ✅ WIN                │    │
                            │   └─────────────────────────────────────────────────┘    │
                            │   ┌─────────────────────────────────────────────────┐    │
                            │   │ "Semantic search docs"                          │    │
                            │   │  Expected: Your Skill                           │    │
                            │   │  Agent picked: Competitor 🔴 STOLEN!            │    │
                            │   └─────────────────────────────────────────────────┘    │
                            └────────────────────────────┬─────────────────────────────┘
                                                         │
                                                         ▼
                            ┌──────────────────────────────────────────────────────────┐
                            │                    R E S U L T S                         │
                            │                                                          │
                            │   Your Skill        ████████████░░░░░░   60% selected    │
                            │   Competitor        ████████░░░░░░░░░░   40% selected    │
                            │                                                          │
                            │   🔴 STEALS: Competitor won 2 of your scenarios          │
                            │   🏆 WINNER: Your Skill (but watch those steals!)        │
                            └──────────────────────────────────────────────────────────┘

The flow:

Input skills — yours and the competition
Generate scenarios — prompts where each skill should be chosen
Simulate — a real agent sees all skills and picks one per task
Track — wins, losses, and steals (when competitors take your scenarios)
Report — selection rates, reasoning, and actionable insights

Quick Start

Installation

pip install skills-arena

Compare Two Skills

from skills_arena import Arena, Config

arena = Arena()
results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    task="web search and content extraction",
)

print(f"Winner: {results.winner}")
print(f"Selection rates: {results.selection_rates}")

Output:

======================================================================
RESULTS
======================================================================

🏆 Winner: Competitor Skill

📊 Selection Rates:
  My Skill             ██████               30%
  Competitor Skill     ██████████████       70%

📋 Scenarios run: 10

----------------------------------------------------------------------
🔴 STEAL DETECTION
----------------------------------------------------------------------
  My Skill: Lost 2 scenario(s) to competitors

Optimize a Skill

Lost the comparison? Let the optimizer fix it:

result = arena.optimize(
    skill="./my-skill.md",
    competitors=["./competitor.md"],
    task="web search and content extraction",
    max_iterations=2,
)

print_results(result)

Output:

======================================================================
OPTIMIZATION RESULTS
======================================================================
Skill:       My Skill
Competitors: Competitor Skill
Scenarios:   6  |  Iterations: 2

Before -> After:
  Selection Rate:  ███████░░░░░░░░░░░░░ 33%  ->  █████████████░░░░░░░ 67%  (+34%)
  Grade:             F  ->  D
  Tokens:           43  ->  40  (-3)

----------------------------------------------------------------------
Iteration 1:  33% -> 67%  (+34%)  [improved]

  Added concrete usage examples, specified output format,
  and differentiated from scraping tools.

  Scenarios:  3 won  |  0 stolen

The optimizer runs a compare → rewrite → verify loop:

Baseline comparison to measure current performance
LLM rewrites the description using competition data and stolen scenario reasoning
Verifies improvement using the same frozen scenarios
Repeats if max_iterations > 1 (stops on regression)

Features

🎯 Realistic Skill Discovery

Skills Arena tests real skill discovery — skills are loaded naturally into the agent's context, exactly how your users experience it. No prompt injection, no artificial setup.

📊 Detailed Results with Reasoning

See exactly why the agent chose each skill:

[Scenario 1]
  Prompt: Find the latest AI news and summarize findings
  Designed for: My Skill
  Selected: Competitor Skill
  Agent's reasoning: I'll help you research AI news. Let me use the
                      competitor skill which handles web research...

🔴 Steal Detection

Know when competitors win scenarios designed for your skill:

🔴 STEAL DETECTION
  My Skill: Lost 2 scenario(s) to competitors
    - scenario-abc123
    - scenario-def456

🎮 Custom Scenarios (Power Users)

Define your own test cases for regression testing, edge cases, or real production prompts:

from skills_arena import Arena, CustomScenario

results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    scenarios=[
        CustomScenario(prompt="Find AI news"),  # Blind test
        CustomScenario(
            prompt="Scrape pricing from stripe.com",
            expected_skill="My Skill",  # Enables steal detection
        ),
    ],
)

🔀 Mix Custom + Generated Scenarios

from skills_arena import CustomScenario, GenerateScenarios

results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    task="web search",
    scenarios=[
        CustomScenario(prompt="My edge case"),
        GenerateScenarios(count=5),  # Generate 5 more with LLM
    ],
)

Configuration

from skills_arena import Arena, Config

config = Config(
    # Scenario generation
    scenarios=10,                       # Number of test scenarios
    scenario_strategy="per_skill",      # "per_skill" or "balanced"
    temperature=0.7,                    # Generation diversity

    # Agent framework
    agents=["claude-code"],             # Uses Claude Agent SDK

    # Execution
    timeout_seconds=60,                 # Per-scenario timeout
)

arena = Arena(config)

Scenario Strategies

Strategy	Description
`balanced`	Generate scenarios for all skills together (default)
`per_skill`	Generate from each skill alone — reveals "steal rates"

Environment Variables

ANTHROPIC_API_KEY=sk-ant-...   # Required

API Reference

Arena Methods

Method	Description
`arena.evaluate(skill, task)`	Evaluate a single skill
`arena.compare(skills, task)`	Compare multiple skills head-to-head
`arena.battle_royale(skills, task)`	Full tournament with ELO rankings
`arena.optimize(skill, competitors, task)`	Auto-improve a skill description

Result Objects

# ComparisonResult
results.winner              # Name of winning skill
results.selection_rates     # {skill_name: rate}
results.scenario_details    # List of ScenarioDetail
results.steals              # {skill_name: [stolen_scenario_ids]}
results.insights            # List of Insight

# ScenarioDetail
detail.prompt               # The test prompt
detail.expected_skill       # Which skill it was designed for
detail.selected_skill       # Which skill the agent chose
detail.reasoning            # Agent's text before selection
detail.was_stolen           # True if competitor won

# OptimizationResult
result.original_skill       # Skill before optimization
result.optimized_skill      # Best skill found
result.iterations           # List of OptimizationIteration
result.total_improvement    # Delta in selection rate
result.selection_rate_before  # Starting selection rate
result.selection_rate_after   # Final selection rate
result.grade_before         # Grade before (A+ to F)
result.grade_after          # Grade after

Custom Scenarios

from skills_arena import CustomScenario, GenerateScenarios

# Blind test (no expected skill)
CustomScenario(prompt="Find AI news")

# With expected skill (enables steal detection)
CustomScenario(
    prompt="Scrape the pricing table",
    expected_skill="Web Scraper",
    tags=["scraping", "pricing"],
)

# Generate N scenarios with LLM
GenerateScenarios(count=5)

Key Metrics

Metric	Description	What It Means
Selection Rate	% of times your skill is chosen	Your share of the context layer
Steal Rate	% of your scenarios won by competitors	Opportunities lost to alternatives
Defense Rate	% of your scenarios you kept	How well you hold your ground

Supported Agents

Agent	Status	Notes
Claude Code	✅ Supported	Primary agent, uses Claude Agent SDK
Codex CLI	🔜 Coming	OpenAI's coding agent
Gemini CLI	🔜 Coming	Google's coding agent
Cursor	🔜 Planned	IDE-integrated agent
Windsurf	🔜 Planned	Codeium's coding agent

Supported Skill Formats

Claude Code — .md skill files with YAML frontmatter
OpenAI — Function calling schemas (JSON)
MCP — Tool definitions
Generic — Plain text descriptions

Roadmap

Filesystem-based skill discovery
Custom scenarios for power users
Agent's reasoning capture
Steal detection
Auto-optimize skill descriptions
Web UI dashboard
Historical tracking & trends
skills.sh integration

Contributing

Contributions welcome! See ARCHITECTURE.md for technical details.

git clone https://github.com/Eyalbenba/skills-arena.git
cd skills-arena
pip install -e ".[dev]"
pytest

License

Apache License 2.0. See LICENSE for details.

Skills Arena — Skills are the new SEO.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Eyalbenb

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

Mar 21, 2026

0.2.1

Feb 22, 2026

0.2.0

Feb 22, 2026

0.1.1

Feb 8, 2026

0.1.0

Feb 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skills_arena-0.3.0.tar.gz (86.2 kB view details)

Uploaded Mar 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

skills_arena-0.3.0-py3-none-any.whl (77.8 kB view details)

Uploaded Mar 21, 2026 Python 3

File details

Details for the file skills_arena-0.3.0.tar.gz.

File metadata

Download URL: skills_arena-0.3.0.tar.gz
Upload date: Mar 21, 2026
Size: 86.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skills_arena-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`0985741b59565e2be1f2737d93f0d857425df4e4b3ed482f283ca47994762de2`
MD5	`67af1e5fe11f4234ab4914f3a50d1a18`
BLAKE2b-256	`ad9f88e8e85a4908e68935ab36ee03238ead9e2150b1efff993be58f0d7a4382`

See more details on using hashes here.

Provenance

The following attestation bundles were made for skills_arena-0.3.0.tar.gz:

Publisher: release.yml on Eyalbenba/skills-arena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: skills_arena-0.3.0.tar.gz
- Subject digest: 0985741b59565e2be1f2737d93f0d857425df4e4b3ed482f283ca47994762de2
- Sigstore transparency entry: 1154537968
- Sigstore integration time: Mar 21, 2026
Source repository:
- Permalink: Eyalbenba/skills-arena@36894d72085906c33e636fb6800f6f6b6c261900
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/Eyalbenba
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@36894d72085906c33e636fb6800f6f6b6c261900
- Trigger Event: push

File details

Details for the file skills_arena-0.3.0-py3-none-any.whl.

File metadata

Download URL: skills_arena-0.3.0-py3-none-any.whl
Upload date: Mar 21, 2026
Size: 77.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skills_arena-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`00d7a12b9ba6b4d73df30227f33c4b0619065c798d5200fa4d200c9a2f707f52`
MD5	`9a5ea70eb840679f2e3fa413a30a8053`
BLAKE2b-256	`e9229cb4f3a7b30cd81021a3734e5d2230ebe2164bfe663d210f15993914ffe8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for skills_arena-0.3.0-py3-none-any.whl:

Publisher: release.yml on Eyalbenba/skills-arena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: skills_arena-0.3.0-py3-none-any.whl
- Subject digest: 00d7a12b9ba6b4d73df30227f33c4b0619065c798d5200fa4d200c9a2f707f52
- Sigstore transparency entry: 1154537970
- Sigstore integration time: Mar 21, 2026
Source repository:
- Permalink: Eyalbenba/skills-arena@36894d72085906c33e636fb6800f6f6b6c261900
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/Eyalbenba
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@36894d72085906c33e636fb6800f6f6b6c261900
- Trigger Event: push

skills-arena 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Why Skills Arena?

How It Works

Quick Start

Installation

Compare Two Skills

Optimize a Skill

Features

🎯 Realistic Skill Discovery

📊 Detailed Results with Reasoning

🔴 Steal Detection

🎮 Custom Scenarios (Power Users)

🔀 Mix Custom + Generated Scenarios

Configuration

Scenario Strategies

Environment Variables

API Reference

Arena Methods

Result Objects

Custom Scenarios

Key Metrics

Supported Agents

Supported Skill Formats

Roadmap

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance