Skip to main content

Evaluate agent skills before you install them. Think 'npm audit' for SKILL.md files.

Project description

๐Ÿ” agent-skill-evaluator

Evaluate agent skills before you install them.
Think npm audit + eslint for SKILL.md files.


Why This Exists

There are 1,000+ agent skills on GitHub. Most have no quality signal beyond stars and recency. Before installing a skill into your agent:

  • Is it structurally sound? (valid SKILL.md, proper metadata)
  • Is it safe? (no prompt injection, credential harvesting, data exfiltration)
  • Is it high quality? (decision trees, guardrails, edge cases)
  • Is it domain-correct? (does it follow best practices for its domain โ€” statistics, marketing, experiment design, etc.?)
  • Is it maintained? (recent updates, tests, documentation)

skill-evaluator answers all five questions with a single command.


Quick Start

# Install
cd skill-evaluator
pip install -e .

# Evaluate a local skill
skill-eval ../skills/experiment-designer/

# Evaluate with Markdown output (for CI/READMEs)
skill-eval ../skills/stats-reviewer/ --format md

# Evaluate with JSON output (for pipelines)
skill-eval ../skills/causal-inference-advisor/ --format json

# Force a specific domain for correctness checking
skill-eval ./some-skill/ --domain statistics
skill-eval ./some-skill/ --domain digital-marketing

# CI mode: fail if score is below threshold
skill-eval ./some-skill/ --fail-below 70

What You Get

Terminal Output

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Skill Evaluation Report: experiment-designer โ€” A (91/100)โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

  ๐Ÿ“‹ Excellent โ€” high quality, well-maintained.

  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ Dimension          โ”‚ Score   โ”‚ Grade โ”‚ Weight โ”‚
  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
  โ”‚ Structure          โ”‚ 95/100  โ”‚ A+    โ”‚ 15%    โ”‚
  โ”‚ Security           โ”‚ 100/100 โ”‚ A+    โ”‚ 20%    โ”‚
  โ”‚ Quality            โ”‚ 90/100  โ”‚ A     โ”‚ 15%    โ”‚
  โ”‚ Domain Correctness โ”‚ 85/100  โ”‚ A-    โ”‚ 25%    โ”‚
  โ”‚ Maintenance        โ”‚ 80/100  โ”‚ B+    โ”‚ 15%    โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Scoring Dimensions

Each dimension produces a 0โ€“100 score. These are combined into a weighted composite:

Dimension Weight What It Checks
Structure 15% YAML frontmatter, required fields (name, description, triggers), section organization
Security 20% Shell injection, credential exfiltration, prompt injection, obfuscation
Quality 15% Decision trees, guardrails, edge cases, escape hatches, code templates
Domain Correctness 25% Rule-based verification of domain-specific methodology and best practices
Maintenance 15% File freshness, documentation, tests, CI config, auxiliary files

Note: Weights are normalized at runtime so they don't need to sum to exactly 100%. If you override weights via --weights, any missing dimensions default to 10%.

Grading Scale

Grade Score Range Meaning
A+ 95โ€“100 Exceptional โ€” install with confidence
A 90โ€“94 Excellent โ€” high quality, well-maintained
A- 85โ€“89 Very good โ€” minor improvements possible
B+ 80โ€“84 Good โ€” solid skill with some gaps
B 75โ€“79 Above average โ€” usable but review findings
B- 70โ€“74 Decent โ€” has notable weaknesses
C+ 65โ€“69 Fair โ€” significant gaps, use with caution
C 60โ€“64 Below average โ€” consider alternatives
C- 50โ€“59 Poor โ€” major issues present
D 40โ€“49 Very poor โ€” not recommended
F 0โ€“39 Failing โ€” critical issues, do not install

Domain Correctness Rules

The novel differentiator. Unlike security/structure checks that any tool can do, domain correctness verifies the guidance itself is correct for its stated domain.

Built-in Domains

Domain Rules Checks
Statistics 10 Normality assumptions, effect sizes, multiple comparisons, power analysis, seed sensitivity, regression assumptions, CI interpretation, Simpson's Paradox
Causal Inference 9 Identification strategies, parallel trends (incl. staggered DiD), IV assumptions, RDD bandwidth, matching balance, HTE, SUTVA/interference
Experiment Design 9 Power analysis, randomization, variance reduction, pre-registration, SRM checks, sequential testing, interference awareness
Data Science 7 Data leakage, missing data mechanisms, cross-validation, metric selection, outlier handling, interpretability, class imbalance
Digital Marketing 28 Attribution modeling, marketing mix modeling, CLV/churn methodology, SEO/SEM, email deliverability, ad tech, privacy compliance
Finance 28 VaR/risk management, portfolio optimization, backtesting biases, DCF valuation, regulatory compliance, time series, market efficiency

The digital marketing domain is organized into 7 sub-domain files:

Sub-domain Rules Focus
Attribution & Measurement 4 Model awareness, incrementality testing, view-through caveats, cross-device
Marketing Mix Modeling 4 Adstock/carryover, diminishing returns, channel interactions, MMM validation
Customer Analytics 4 CLV methodology, churn definition, segmentation, cohort analysis
SEO / SEM 5 Technical SEO, bidding strategy, keyword research, E-E-A-T, AI search impact
Email / CRM 4 Deliverability (SPF/DKIM/DMARC), list hygiene, consent compliance, personalization
Ad Tech / Programmatic 3 Auction mechanics, frequency capping, viewability and fraud
General Marketing 4 Funnel understanding, privacy compliance, tracking infrastructure, KPI alignment

The finance domain is organized into 7 sub-domain files:

Sub-domain Rules Focus
Risk Management 3 VaR tail-risk assumptions, stress testing, risk-adjusted metrics (Sharpe/Sortino)
Portfolio & Allocation 4 MPT limitations, benchmark selection, diversification depth, rebalancing methodology
Backtesting 5 Look-ahead bias, survivorship bias, transaction costs, overfitting, multiple testing bias
Valuation & Pricing 3 DCF sensitivity analysis, relative valuation pitfalls, option pricing model selection
Regulatory & Compliance 3 Regulatory awareness (SEC/FCA/ESMA), KYC/AML, financial data privacy
Time Series & Forecasting 3 Stationarity requirements, regime changes, return distribution assumptions
General Finance 4 Return calculation methodology, inflation adjustment, tax implications, market efficiency

Domain Auto-Detection

When you run skill-eval without --domain, the analyzer auto-detects the most likely domain by counting keyword signals in the skill content. For example, a skill mentioning "attribution," "ROAS," and "landing page" detects as digital-marketing, while one mentioning "p-value," "effect size," and "t-test" detects as statistics.

Adding Custom Domains

Create a YAML file in skill_evaluator/domains/:

domain: my-domain
version: "1.0.0"
rules:
  - name: my-rule
    description: "What this rule checks"
    applicability_patterns:
      - "pattern that makes this rule relevant"
    required_patterns:
      - "pattern that SHOULD be present"
    antipatterns:
      - "pattern that SHOULD NOT be present"
    failure_severity: suspicious  # or "incorrect"
    failure_message: "What went wrong"
    success_message: "What went right"

For domains with many rules, you can organize them into a directory instead of a single file. Create skill_evaluator/domains/my-domain/ and place multiple .yaml files inside โ€” all rules are automatically merged at load time.

We currently ship with 7 domains: Statistics, Causal Inference, Experiment Design, Data Science, Digital Marketing, and Finance โ€” totaling 91 rules. This is meant to be extensible. If you have domain expertise in another area (e.g., healthcare, product management, cybersecurity, NLP evaluation) and want to contribute a rule set, open a PR and we'll review it. The more domains covered, the more useful this tool becomes for everyone.


Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run against your own skills
skill-eval ../skills/experiment-designer/
skill-eval ../skills/stats-reviewer/

How Scoring Works

Each dimension (Structure, Security, Quality, Domain Correctness, Maintenance) produces a 0โ€“100 score. These are combined into a weighted composite, normalized by the total weight of applicable dimensions.

A few things worth knowing:

Security is a gate, not just a weight. If the security analyzer finds critical risks (prompt injection, credential harvesting, destructive shell commands), the overall score is hard-capped regardless of how well the skill scores on other dimensions. A skill with a prompt injection vulnerability gets an F even if the content is otherwise excellent.

Domain correctness uses a two-tier severity model. Rules flagged as incorrect (e.g., recommending last-click attribution as the only model, or computing CLV without a proper probabilistic framework) carry heavier penalties than those flagged as suspicious (best-practice recommendations that may vary by context). This lets the evaluator distinguish hard errors from soft guidance.

Scores are normalized by applicable checks. Quality and maintenance scores are computed as the ratio of passed checks to applicable checks. A concise, well-written skill won't score lower than a verbose one just because it has fewer regex matches. Each check that fires gets an equal vote.

Structural penalties are weighted by severity. Missing your SKILL.md entirely costs more than missing a recommended field. The penalty for each finding reflects how much it actually impacts usability.

What the scores don't tell you. The current system is a static linter โ€” it checks patterns in text. It can't tell you whether the skill actually makes an agent perform better, or whether the domain guidance is semantically correct beyond keyword matching. Those are harder problems (see below).


Future Research

These are directions we'd like to explore. Contributions welcome.

Empirical weight calibration. The dimension weights (Security 20%, Domain 25%, etc.) are based on informed judgment, not data. The right approach is to score a labeled corpus of known-good vs. known-bad skills and use the results to find weights that best predict the label. If you have a labeled skill corpus or want to help build one, open an issue.

Semantic domain checks. The domain correctness analyzer currently uses regex patterns โ€” it checks whether certain keywords appear, not whether the guidance is actually correct. Replacing this with embedding-based or lightweight LLM checks (e.g., "does this skill enforce power analysis, or just mention it?") would substantially improve accuracy.

Behavioral evaluation. The ultimate test of a skill is: does an agent using it produce better outputs? A test harness with known-correct answers (e.g., "given this dataset, should the agent refuse to run the test?") would let us score skills by downstream impact rather than surface patterns. This is the direction taken by SkillsBench-style evaluations and would make this tool a performance predictor, not just a linter.

More domains. We're actively expanding domain coverage. Next candidates include healthcare, product management, and cybersecurity. Contributions welcome โ€” each new domain just requires a YAML rule set and test fixtures.


CI / CD Integration

GitHub Action

Use agent-skill-evaluator as a reusable GitHub Action to evaluate skills in your CI pipeline:

# .github/workflows/skill-eval.yml
name: Evaluate Skills
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Evaluate skill
        uses: WilliamWJHuang/agent-skill-evaluator@master
        id: eval
        with:
          path: './my-skill/'       # Path to skill directory
          fail-below: '60'          # Fail CI if score < 60
          # domain: 'statistics'    # Optional: force a domain

      - name: Print results
        run: |
          echo "Score: ${{ steps.eval.outputs.score }}"
          echo "Grade: ${{ steps.eval.outputs.grade }}"

Inputs

Input Required Default Description
path โœ… . Path to skill directory or SKILL.md
domain auto Force a specific domain
fail-below 50 Fail if score is below threshold
format terminal Output format: terminal, md, json

Outputs

Output Description
score Overall score (0-100)
grade Letter grade (A+ through F)
report Full markdown evaluation report

The action also writes a Job Summary with the full report, visible directly in the GitHub Actions UI.

CLI in CI (without the Action)

pip install git+https://github.com/WilliamWJHuang/agent-skill-evaluator.git
skill-eval ./my-skill/ --fail-below 60

The --fail-below flag exits with code 1 if the score is below the threshold.


License

MIT โ€” Use freely, attribute kindly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_skill_evaluator-0.1.0.tar.gz (70.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_skill_evaluator-0.1.0-py3-none-any.whl (70.6 kB view details)

Uploaded Python 3

File details

Details for the file agent_skill_evaluator-0.1.0.tar.gz.

File metadata

  • Download URL: agent_skill_evaluator-0.1.0.tar.gz
  • Upload date:
  • Size: 70.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for agent_skill_evaluator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ae221b88b731bf7736d1ef888cb0d9b4ccdf9e13b4f2b7576a1763df046ab436
MD5 2e8fcdd5e6c679984f66b0dcafb68af3
BLAKE2b-256 d9b018025dc7b06e1244029e18d767123b0ba420ada8bf14671a32c07dd66c21

See more details on using hashes here.

File details

Details for the file agent_skill_evaluator-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agent_skill_evaluator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3f09a9636ad4b24f6c1b58e83511b1b639300db93f8cb0718b10cb8a4960d8cb
MD5 ed3d855c492e823bcfce24d98edc391b
BLAKE2b-256 37edac81ed240efe81b825ce48cf31d02ad855bef8274c2449aa661455388372

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page