Evaluate agent skills before you install them. Think 'npm audit' for SKILL.md files.

These details have not been verified by PyPI

Project links

Project description

🔍 agent-skill-evaluator

Evaluate agent skills before you install them.
Think npm audit + eslint for SKILL.md files.

Why This Exists

There are 1,000+ agent skills on GitHub. Most have no quality signal beyond stars and recency. Before installing a skill into your agent:

Is it structurally sound? (valid SKILL.md, proper metadata)
Is it safe? (no prompt injection, credential harvesting, data exfiltration)
Is it high quality? (decision trees, guardrails, edge cases)
Is it domain-correct? (does it follow best practices for its domain — statistics, marketing, experiment design, etc.?)
Is it maintained? (recent updates, tests, documentation)

skill-evaluator answers all five questions with a single command.

Quick Start

# Install
cd skill-evaluator
pip install -e .

# Evaluate a local skill
skill-eval ../skills/experiment-designer/

# Evaluate with Markdown output (for CI/READMEs)
skill-eval ../skills/stats-reviewer/ --format md

# Evaluate with JSON output (for pipelines)
skill-eval ../skills/causal-inference-advisor/ --format json

# Force a specific domain for correctness checking
skill-eval ./some-skill/ --domain statistics
skill-eval ./some-skill/ --domain digital-marketing

# CI mode: fail if score is below threshold
skill-eval ./some-skill/ --fail-below 70

What You Get

Terminal Output

╭──────────────────────────────────────────────────────────╮
│ Skill Evaluation Report: experiment-designer — A (91/100)│
╰──────────────────────────────────────────────────────────╯

  📋 Excellent — high quality, well-maintained.

  ┌────────────────────┬─────────┬───────┬────────┐
  │ Dimension          │ Score   │ Grade │ Weight │
  ├────────────────────┼─────────┼───────┼────────┤
  │ Structure          │ 95/100  │ A+    │ 15%    │
  │ Security           │ 100/100 │ A+    │ 20%    │
  │ Quality            │ 90/100  │ A     │ 15%    │
  │ Domain Correctness │ 85/100  │ A-    │ 25%    │
  │ Maintenance        │ 80/100  │ B+    │ 15%    │
  └────────────────────┴─────────┴───────┴────────┘

Scoring Dimensions

Each dimension produces a 0–100 score. These are combined into a weighted composite:

Dimension	Weight	What It Checks
Structure	15%	YAML frontmatter, required fields (`name`, `description`, `triggers`), section organization
Security	20%	Shell injection, credential exfiltration, prompt injection, obfuscation
Quality	15%	Decision trees, guardrails, edge cases, escape hatches, code templates
Domain Correctness	25%	Rule-based verification of domain-specific methodology and best practices
Maintenance	15%	File freshness, documentation, tests, CI config, auxiliary files

Note: Weights are normalized at runtime so they don't need to sum to exactly 100%. If you override weights via --weights, any missing dimensions default to 10%.

Grading Scale

Grade	Score Range	Meaning
A+	95–100	Exceptional — install with confidence
A	90–94	Excellent — high quality, well-maintained
A-	85–89	Very good — minor improvements possible
B+	80–84	Good — solid skill with some gaps
B	75–79	Above average — usable but review findings
B-	70–74	Decent — has notable weaknesses
C+	65–69	Fair — significant gaps, use with caution
C	60–64	Below average — consider alternatives
C-	50–59	Poor — major issues present
D	40–49	Very poor — not recommended
F	0–39	Failing — critical issues, do not install

Domain Correctness Rules

The novel differentiator. Unlike security/structure checks that any tool can do, domain correctness verifies the guidance itself is correct for its stated domain.

Built-in Domains

Domain	Rules	Checks
Statistics	10	Normality assumptions, effect sizes, multiple comparisons, power analysis, seed sensitivity, regression assumptions, CI interpretation, Simpson's Paradox
Causal Inference	9	Identification strategies, parallel trends (incl. staggered DiD), IV assumptions, RDD bandwidth, matching balance, HTE, SUTVA/interference
Experiment Design	9	Power analysis, randomization, variance reduction, pre-registration, SRM checks, sequential testing, interference awareness
Data Science	7	Data leakage, missing data mechanisms, cross-validation, metric selection, outlier handling, interpretability, class imbalance
Digital Marketing	28	Attribution modeling, marketing mix modeling, CLV/churn methodology, SEO/SEM, email deliverability, ad tech, privacy compliance
Finance	28	VaR/risk management, portfolio optimization, backtesting biases, DCF valuation, regulatory compliance, time series, market efficiency

The digital marketing domain is organized into 7 sub-domain files:

Sub-domain	Rules	Focus
Attribution & Measurement	4	Model awareness, incrementality testing, view-through caveats, cross-device
Marketing Mix Modeling	4	Adstock/carryover, diminishing returns, channel interactions, MMM validation
Customer Analytics	4	CLV methodology, churn definition, segmentation, cohort analysis
SEO / SEM	5	Technical SEO, bidding strategy, keyword research, E-E-A-T, AI search impact
Email / CRM	4	Deliverability (SPF/DKIM/DMARC), list hygiene, consent compliance, personalization
Ad Tech / Programmatic	3	Auction mechanics, frequency capping, viewability and fraud
General Marketing	4	Funnel understanding, privacy compliance, tracking infrastructure, KPI alignment

The finance domain is organized into 7 sub-domain files:

Sub-domain	Rules	Focus
Risk Management	3	VaR tail-risk assumptions, stress testing, risk-adjusted metrics (Sharpe/Sortino)
Portfolio & Allocation	4	MPT limitations, benchmark selection, diversification depth, rebalancing methodology
Backtesting	5	Look-ahead bias, survivorship bias, transaction costs, overfitting, multiple testing bias
Valuation & Pricing	3	DCF sensitivity analysis, relative valuation pitfalls, option pricing model selection
Regulatory & Compliance	3	Regulatory awareness (SEC/FCA/ESMA), KYC/AML, financial data privacy
Time Series & Forecasting	3	Stationarity requirements, regime changes, return distribution assumptions
General Finance	4	Return calculation methodology, inflation adjustment, tax implications, market efficiency

Domain Auto-Detection

When you run skill-eval without --domain, the analyzer auto-detects the most likely domain by counting keyword signals in the skill content. For example, a skill mentioning "attribution," "ROAS," and "landing page" detects as digital-marketing, while one mentioning "p-value," "effect size," and "t-test" detects as statistics.

Adding Custom Domains

Create a YAML file in skill_evaluator/domains/:

domain: my-domain
version: "1.0.0"
rules:
  - name: my-rule
    description: "What this rule checks"
    applicability_patterns:
      - "pattern that makes this rule relevant"
    required_patterns:
      - "pattern that SHOULD be present"
    antipatterns:
      - "pattern that SHOULD NOT be present"
    failure_severity: suspicious  # or "incorrect"
    failure_message: "What went wrong"
    success_message: "What went right"

For domains with many rules, you can organize them into a directory instead of a single file. Create skill_evaluator/domains/my-domain/ and place multiple .yaml files inside — all rules are automatically merged at load time.

We currently ship with 7 domains: Statistics, Causal Inference, Experiment Design, Data Science, Digital Marketing, and Finance — totaling 91 rules. This is meant to be extensible. If you have domain expertise in another area (e.g., healthcare, product management, cybersecurity, NLP evaluation) and want to contribute a rule set, open a PR and we'll review it. The more domains covered, the more useful this tool becomes for everyone.

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run against your own skills
skill-eval ../skills/experiment-designer/
skill-eval ../skills/stats-reviewer/

How Scoring Works

Each dimension (Structure, Security, Quality, Domain Correctness, Maintenance) produces a 0–100 score. These are combined into a weighted composite, normalized by the total weight of applicable dimensions.

A few things worth knowing:

Security is a gate, not just a weight. If the security analyzer finds critical risks (prompt injection, credential harvesting, destructive shell commands), the overall score is hard-capped regardless of how well the skill scores on other dimensions. A skill with a prompt injection vulnerability gets an F even if the content is otherwise excellent.

Domain correctness uses a two-tier severity model. Rules flagged as incorrect (e.g., recommending last-click attribution as the only model, or computing CLV without a proper probabilistic framework) carry heavier penalties than those flagged as suspicious (best-practice recommendations that may vary by context). This lets the evaluator distinguish hard errors from soft guidance.

Scores are normalized by applicable checks. Quality and maintenance scores are computed as the ratio of passed checks to applicable checks. A concise, well-written skill won't score lower than a verbose one just because it has fewer regex matches. Each check that fires gets an equal vote.

Structural penalties are weighted by severity. Missing your SKILL.md entirely costs more than missing a recommended field. The penalty for each finding reflects how much it actually impacts usability.

What the scores don't tell you. The current system is a static linter — it checks patterns in text. It can't tell you whether the skill actually makes an agent perform better, or whether the domain guidance is semantically correct beyond keyword matching. Those are harder problems (see below).

Future Research

These are directions we'd like to explore. Contributions welcome.

Empirical weight calibration. The dimension weights (Security 20%, Domain 25%, etc.) are based on informed judgment, not data. The right approach is to score a labeled corpus of known-good vs. known-bad skills and use the results to find weights that best predict the label. If you have a labeled skill corpus or want to help build one, open an issue.

Semantic domain checks. The domain correctness analyzer currently uses regex patterns — it checks whether certain keywords appear, not whether the guidance is actually correct. Replacing this with embedding-based or lightweight LLM checks (e.g., "does this skill enforce power analysis, or just mention it?") would substantially improve accuracy.

Behavioral evaluation. The ultimate test of a skill is: does an agent using it produce better outputs? A test harness with known-correct answers (e.g., "given this dataset, should the agent refuse to run the test?") would let us score skills by downstream impact rather than surface patterns. This is the direction taken by SkillsBench-style evaluations and would make this tool a performance predictor, not just a linter.

More domains. We're actively expanding domain coverage. Next candidates include healthcare, product management, and cybersecurity. Contributions welcome — each new domain just requires a YAML rule set and test fixtures.

CI / CD Integration

GitHub Action

Use agent-skill-evaluator as a reusable GitHub Action to evaluate skills in your CI pipeline:

# .github/workflows/skill-eval.yml
name: Evaluate Skills
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Evaluate skill
        uses: WilliamWJHuang/agent-skill-evaluator@master
        id: eval
        with:
          path: './my-skill/'       # Path to skill directory
          fail-below: '60'          # Fail CI if score < 60
          # domain: 'statistics'    # Optional: force a domain

      - name: Print results
        run: |
          echo "Score: ${{ steps.eval.outputs.score }}"
          echo "Grade: ${{ steps.eval.outputs.grade }}"

Inputs

Input	Required	Default	Description
`path`	✅	`.`	Path to skill directory or SKILL.md
`domain`		auto	Force a specific domain
`fail-below`		`50`	Fail if score is below threshold
`format`		`terminal`	Output format: `terminal`, `md`, `json`

Outputs

Output	Description
`score`	Overall score (0-100)
`grade`	Letter grade (A+ through F)
`report`	Full markdown evaluation report

The action also writes a Job Summary with the full report, visible directly in the GitHub Actions UI.

CLI in CI (without the Action)

pip install git+https://github.com/WilliamWJHuang/agent-skill-evaluator.git
skill-eval ./my-skill/ --fail-below 60

The --fail-below flag exits with code 1 if the score is below the threshold.

License

MIT — Use freely, attribute kindly.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_skill_evaluator-0.1.0.tar.gz (70.1 kB view details)

Uploaded Apr 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_skill_evaluator-0.1.0-py3-none-any.whl (70.6 kB view details)

Uploaded Apr 18, 2026 Python 3

File details

Details for the file agent_skill_evaluator-0.1.0.tar.gz.

File metadata

Download URL: agent_skill_evaluator-0.1.0.tar.gz
Upload date: Apr 18, 2026
Size: 70.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for agent_skill_evaluator-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ae221b88b731bf7736d1ef888cb0d9b4ccdf9e13b4f2b7576a1763df046ab436`
MD5	`2e8fcdd5e6c679984f66b0dcafb68af3`
BLAKE2b-256	`d9b018025dc7b06e1244029e18d767123b0ba420ada8bf14671a32c07dd66c21`

See more details on using hashes here.

File details

Details for the file agent_skill_evaluator-0.1.0-py3-none-any.whl.

File metadata

Download URL: agent_skill_evaluator-0.1.0-py3-none-any.whl
Upload date: Apr 18, 2026
Size: 70.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for agent_skill_evaluator-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3f09a9636ad4b24f6c1b58e83511b1b639300db93f8cb0718b10cb8a4960d8cb`
MD5	`ed3d855c492e823bcfce24d98edc391b`
BLAKE2b-256	`37edac81ed240efe81b825ce48cf31d02ad855bef8274c2449aa661455388372`

See more details on using hashes here.

agent-skill-evaluator 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🔍 agent-skill-evaluator

Why This Exists

Quick Start

What You Get

Terminal Output

Scoring Dimensions

Grading Scale

Domain Correctness Rules

Built-in Domains

Domain Auto-Detection

Adding Custom Domains

Development

How Scoring Works

Future Research

CI / CD Integration

GitHub Action

Inputs

Outputs

CLI in CI (without the Action)

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes