Skip to main content

CLI for Claude Code marketplace health — routing evals, coverage checks, and semantic collision detection

Project description

claude-marketplace-evaluator

cme is a CLI for Claude Code marketplace health. It validates that skills route correctly and detects semantic collisions between skills — without running expensive LLM-based eval suites.

Marketplace health is not about optimizing skill descriptions. It is about catching structural problems early: missing evals, broken routing, and overlapping skills that confuse Claude's router. cme runs fast, fits in CI, and fails loud.

Installation

Zero-install with uvx (recommended for CI):

uvx --from claude-marketplace-evaluator cme --help

Or install globally:

pip install claude-marketplace-evaluator

Commands

cme routing

Three-step pipeline that generates routing tests, checks coverage, and runs evals:

  1. Generate — reads evals/evals.json from each skill directory, produces routing test YAML
  2. Coverage check — verifies every skill has an evals file (fails if below threshold)
  3. Routing eval runner — sends each prompt through the Claude Agent SDK, checks Claude routes to the expected skill
cme routing --plugins-dir plugins/
Flag Default Description
--plugins-dir plugins/ Path to the plugins directory
--coverage-threshold 100 Minimum eval coverage percentage. Fails if any skill lacks evals
--threshold 95 Minimum routing pass rate percentage. Set to 0 to skip the eval runner
-j / --workers 4 Parallel workers for the eval runner
--timeout 30 Per-test timeout in seconds
--max-retries 1 Max retries on rate limit errors (exponential backoff)

Exit codes: 0 = all checks pass, 1 = coverage or routing threshold not met.

cme overlap

Detects semantic collisions between skills across a marketplace. Two skills collide when their descriptions or trigger queries are similar enough to confuse Claude's routing. Uses an LLM to analyze all skill pairs and produces a JSON report with severity: high | medium | low collision pairs.

cme overlap --plugins-dir plugins/ --output overlap-report.json
Flag Default Description
--plugins-dir plugins/ Path to the plugins directory
--output overlap-report.json Output path for the JSON collision report
--model claude-sonnet-4-5 Model for analysis (overrides ANTHROPIC_MODEL env var)

Exit codes: 0 = no collisions, 1 = collisions detected.

The output report structure:

{
  "timestamp": "2026-04-17T00:00:00+00:00",
  "model_used": "claude-sonnet-4-5",
  "total_skills_analyzed": 6,
  "total_collisions": 1,
  "collisions": [
    {
      "skill_a": "plugins/my-plugin/skills/create-pr",
      "skill_b": "plugins/my-plugin/skills/submit-pr",
      "overlapping_triggers": ["open a pull request"],
      "description_excerpts": ["Both skills handle PR creation workflows"],
      "severity": "high"
    }
  ]
}

Plugin Layout

cme expects this directory structure:

plugins/
  <plugin-name>/
    skills/
      <skill-name>/
        SKILL.md
        evals/
          evals.json

Each evals.json is a JSON array of routing test entries:

[
  { "query": "Run the test suite for this project", "should_trigger": true },
  { "query": "Can you execute the unit tests?", "should_trigger": true },
  { "query": "Open a pull request for this branch", "should_trigger": false }
]
Field Type Description
query string A user prompt to test routing against
should_trigger boolean true = this prompt should route to this skill, false = it should not

Only should_trigger: true entries are used to generate routing test cases. Include should_trigger: false entries to document negative cases (used by overlap detection for trigger context).

Authentication

cme does not manage credentials. It passes through environment variables to the Claude Agent SDK. Configure one of these auth modes:

Claude subscription (OAuth)

For users with a Claude Pro/Team/Enterprise subscription:

claude setup-token              # generates the token
export CLAUDE_CODE_OAUTH_TOKEN="your-token"
cme routing --plugins-dir plugins/

Direct API key

For direct Anthropic API access:

export ANTHROPIC_API_KEY="sk-ant-..."
cme routing --plugins-dir plugins/

Databricks AI Gateway

For routing through Databricks AI Gateway, map your workspace secrets to the standard Anthropic SDK env vars:

export ANTHROPIC_AUTH_TOKEN="<DATABRICKS_SP_TOKEN>"           # service principal PAT
export ANTHROPIC_BASE_URL="<DATABRICKS_AI_GATEWAY_URL>"       # AI Gateway endpoint URL
export ANTHROPIC_MODEL="<DATABRICKS_AI_GATEWAY_MODEL>"        # endpoint model name
export ANTHROPIC_CUSTOM_HEADERS="x-databricks-use-coding-agent-mode: true"
export CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS="1"
export CLAUDE_CODE_ENABLE_FINE_GRAINED_TOOL_STREAMING=""
cme routing --plugins-dir plugins/

In GitHub Actions, these map directly from repository secrets:

env:
  ANTHROPIC_AUTH_TOKEN: ${{ secrets.DATABRICKS_SP_TOKEN }}
  ANTHROPIC_BASE_URL: ${{ secrets.DATABRICKS_AI_GATEWAY_URL }}
  ANTHROPIC_MODEL: ${{ secrets.DATABRICKS_AI_GATEWAY_MODEL }}
  ANTHROPIC_CUSTOM_HEADERS: "x-databricks-use-coding-agent-mode: true"
  CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1"
  CLAUDE_CODE_ENABLE_FINE_GRAINED_TOOL_STREAMING: ""

CI/CD Integration

GitHub Actions workflow

This is a production workflow from claude-marketplace-builder that runs both cme routing and cme overlap on every PR that touches plugin files:

name: CME Checks

on:
  pull_request:
    paths:
      - "plugins/**"
      - "evals/**"
  workflow_dispatch:

jobs:
  coverage:
    runs-on: ubuntu-latest
    environment: cicd
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
      - name: Check eval coverage and routing
        env:
          ANTHROPIC_AUTH_TOKEN: ${{ secrets.DATABRICKS_SP_TOKEN }}
          ANTHROPIC_BASE_URL: ${{ secrets.DATABRICKS_AI_GATEWAY_URL }}
          ANTHROPIC_MODEL: ${{ secrets.DATABRICKS_AI_GATEWAY_MODEL }}
          ANTHROPIC_CUSTOM_HEADERS: "x-databricks-use-coding-agent-mode: true"
          CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1"
        run: uvx --from claude-marketplace-evaluator cme routing --plugins-dir plugins/ --coverage-threshold 100 --threshold 95 --timeout 180

  overlap:
    runs-on: ubuntu-latest
    environment: cicd
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
      - name: Check skill overlap
        env:
          ANTHROPIC_AUTH_TOKEN: ${{ secrets.DATABRICKS_SP_TOKEN }}
          ANTHROPIC_BASE_URL: ${{ secrets.DATABRICKS_AI_GATEWAY_URL }}
          ANTHROPIC_MODEL: ${{ secrets.DATABRICKS_AI_GATEWAY_MODEL }}
          ANTHROPIC_CUSTOM_HEADERS: "x-databricks-use-coding-agent-mode: true"
          CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1"
        run: |
          set +e
          uvx --from claude-marketplace-evaluator cme overlap --plugins-dir plugins/ --output overlap-report.json
          EXIT_CODE=$?
          if [ -f overlap-report.json ]; then
            echo "## Overlap Report" >> "$GITHUB_STEP_SUMMARY"
            echo '```json' >> "$GITHUB_STEP_SUMMARY"
            cat overlap-report.json >> "$GITHUB_STEP_SUMMARY"
            echo '```' >> "$GITHUB_STEP_SUMMARY"
            cat overlap-report.json
          fi
          exit $EXIT_CODE

Posting overlap results as a PR comment

Extend the overlap job to post a formatted collision table as a PR comment using actions/github-script:

  overlap:
    runs-on: ubuntu-latest
    environment: cicd
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
      - name: Check skill overlap
        id: overlap
        env:
          ANTHROPIC_AUTH_TOKEN: ${{ secrets.DATABRICKS_SP_TOKEN }}
          ANTHROPIC_BASE_URL: ${{ secrets.DATABRICKS_AI_GATEWAY_URL }}
          ANTHROPIC_MODEL: ${{ secrets.DATABRICKS_AI_GATEWAY_MODEL }}
          ANTHROPIC_CUSTOM_HEADERS: "x-databricks-use-coding-agent-mode: true"
          CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1"
        run: |
          set +e
          uvx --from claude-marketplace-evaluator cme overlap --plugins-dir plugins/ --output overlap-report.json
          echo "exit_code=$?" >> "$GITHUB_OUTPUT"
          if [ -f overlap-report.json ]; then
            echo "## Overlap Report" >> "$GITHUB_STEP_SUMMARY"
            echo '```json' >> "$GITHUB_STEP_SUMMARY"
            cat overlap-report.json >> "$GITHUB_STEP_SUMMARY"
            echo '```' >> "$GITHUB_STEP_SUMMARY"
          fi

      - name: Comment on PR with overlap results
        if: always() && github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const path = 'overlap-report.json';
            if (!fs.existsSync(path)) return;

            const report = JSON.parse(fs.readFileSync(path, 'utf8'));
            const collisions = report.collisions || [];

            let body = '## Skill Overlap Report\n\n';
            body += `**Skills analyzed:** ${report.total_skills_analyzed}\n`;
            body += `**Collisions found:** ${report.total_collisions}\n\n`;

            if (collisions.length === 0) {
              body += '✅ No semantic collisions detected.\n';
            } else {
              body += '| Severity | Skill A | Skill B | Overlapping Triggers |\n';
              body += '|----------|---------|---------|---------------------|\n';
              for (const c of collisions) {
                const triggers = c.overlapping_triggers.join(', ');
                body += `| ${c.severity.toUpperCase()} | \`${c.skill_a}\` | \`${c.skill_b}\` | ${triggers} |\n`;
              }
              body += '\nResolve collisions before merging. Rename skills, narrow descriptions, or deduplicate functionality.\n';
            }

            // Delete previous cme comments to avoid spam
            const { data: comments } = await github.rest.issues.listComments({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
            });
            for (const comment of comments) {
              if (comment.body.startsWith('## Skill Overlap Report')) {
                await github.rest.issues.deleteComment({
                  owner: context.repo.owner,
                  repo: context.repo.repo,
                  comment_id: comment.id,
                });
              }
            }

            await github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body,
            });

      - name: Fail on collisions
        if: steps.overlap.outputs.exit_code != '0'
        run: exit 1

Local Usage

Run against a local plugins directory:

# Coverage check only (skip routing evals)
cme routing --plugins-dir ./plugins --threshold 0

# Full routing pipeline
cme routing --plugins-dir ./plugins --timeout 60

# Overlap detection
cme overlap --plugins-dir ./plugins

# Increase parallelism for large marketplaces
cme routing --plugins-dir ./plugins -j 8 --timeout 120

# Debug mode (verbose Agent SDK logging)
CME_DEBUG=1 cme routing --plugins-dir ./plugins

Two-Tier Eval Strategy

cme is designed as the fast, structural first tier of a two-tier evaluation approach:

Tier 1: cme (fast, free, structural)

  • Runs in seconds to minutes
  • Coverage checks require zero LLM calls
  • Routing evals use one short Agent SDK call per test case
  • Catches missing evals, broken routing, and skill collisions
  • Runs on every PR in CI

Tier 2: Full LLM eval runners (deep, expensive)

  • Runs multi-turn conversations testing skill behavior end-to-end
  • Validates output quality, not just routing correctness
  • Costs significantly more in tokens and time
  • Runs on release branches or nightly schedules

cme answers "did Claude pick the right skill?" — it does not answer "did the skill produce a good result?" Use tier 1 to gate PRs cheaply, then run tier 2 for deeper validation on release candidates.

Development

uv sync
pre-commit install
make check   # lint + format + typecheck
make test    # pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

claude_marketplace_evaluator-0.2.1.tar.gz (95.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

claude_marketplace_evaluator-0.2.1-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file claude_marketplace_evaluator-0.2.1.tar.gz.

File metadata

File hashes

Hashes for claude_marketplace_evaluator-0.2.1.tar.gz
Algorithm Hash digest
SHA256 29adb91a0c8ab43e746afe5bcb7670a3be556fd94cdb67095e8d388cd0ffc2ff
MD5 fc667337669bf5fe255dd94b6355417e
BLAKE2b-256 0548111ae133c2845957696205d7304d82bd828fb2e78f580657eea768eee3c7

See more details on using hashes here.

Provenance

The following attestation bundles were made for claude_marketplace_evaluator-0.2.1.tar.gz:

Publisher: release.yml on IceRhymers/claude-marketplace-evaluator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file claude_marketplace_evaluator-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for claude_marketplace_evaluator-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 caf545965cbbafc17f1f0f1e4e0006a7297945afbb0b3a64d1455da9e45554b8
MD5 19a3ba316f42b14bc5b8ebabfc77b867
BLAKE2b-256 c5bd3ae3621a2627b82f6c3c54ce883f9f554d74fb55269603aff919dba5407f

See more details on using hashes here.

Provenance

The following attestation bundles were made for claude_marketplace_evaluator-0.2.1-py3-none-any.whl:

Publisher: release.yml on IceRhymers/claude-marketplace-evaluator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page