Skip to main content

通用 AI Skill 评测引擎 — 自解析、自生成测试、自执行、自评估

Project description

Skill-Cert: AI Skill Evaluation Engine

Automated evaluation engine for AI agent skills (SKILL.md files).

Skill-Cert takes any SKILL.md file — the instruction format used by Claude Code, Codex, OpenCode, Cursor, and other AI coding agents — and evaluates it through a rigorous automated pipeline. It parses skill structure, generates test cases, executes with-skill vs without-skill comparisons, computes L1-L8 metrics, detects cross-model drift, and produces standardized PASS / PASS_WITH_CAVEATS / FAIL verdicts.

In one sentence:

Skill-Cert turns "does this skill actually work?" from a subjective feeling into repeatable, quantifiable, comparable evaluation results.

English | 简体中文


Table of Contents


1. Why Skill-Cert?

Teams write Skills for AI coding agents all the time — code review skills, security audit skills, documentation skills, debugging skills, PR workflows, browser QA, project-specific conventions.

But after writing a Skill, you face several problems:

1.1 You don't know if the Skill actually works

The Skill looks thorough on paper. But does the model trigger it in the right scenarios? Does it follow the workflow? Is the output actually better than without the Skill?

Without evaluation, you're relying on a few manual trials. Conclusions are easily skewed by sample selection, model state, and reviewer subjectivity.

1.2 You don't know if the Skill is stable

Does the same Skill perform consistently across Claude, GPT, Qwen, DeepSeek, and Gemini? Do multiple runs produce stable results? Is the Skill only effective on one model and useless on others?

These require systematic cross-model, cross-run evaluation.

1.3 You don't know if the Skill is safe

A Skill is essentially high-priority operational guidance for the model. If it contains dangerous commands, credential access, prompt injection, or data exfiltration instructions, it poses a security risk.

Skill-Cert runs security scanning before evaluation to catch risks early.

1.4 You don't know the cost and latency impact

Skills typically increase context length, tool calls, and reasoning steps. This may improve output quality but also increase cost and response time.

Skill-Cert tracks tokens, cost, and latency, and evaluates whether the benefit justifies the overhead.


2. What It Does

Skill-Cert takes a SKILL.md file and runs a complete evaluation pipeline:

SKILL.md
  ↓
Parse skill structure
  ↓
Security scan
  ↓
Auto-generate eval tests
  ↓
Self-review + gap-fill
  ↓
with-skill / without-skill execution
  ↓
Assertion grading + LLM-as-judge
  ↓
L1-L8 metrics calculation
  ↓
Cross-model drift detection
  ↓
Markdown + JSON report

Output:

  • {skill}-report.md — human-readable evaluation report
  • {skill}-result.json — machine-readable structured results
  • {skill}-evals-cache.json — eval cases and execution cache

3. Core Philosophy

Skill-Cert's core assumption:

A good Skill shouldn't just "look reasonable" — it must demonstrably improve model performance on real tasks.

Evaluation isn't about checking the SKILL.md text. It's about answering four questions:

Question Metric
Does the model know when to use this Skill? L1 Trigger Accuracy
Does the Skill actually improve results? L2 Output Delta
Does the model follow the Skill's workflow? L3 Step Adherence
Are results stable across runs and models? L4 Stability / Drift

Beyond these, we extend to efficiency, security, cost, latency, and multi-turn dialogue quality.


4. Evaluation Pipeline

Phase 0: Skill Parsing

Implementation: engine/analyzer.pySkillSpec, WorkflowStep, parse_skill_md().

Skill-Cert reads SKILL.md and extracts a structured semantic model (SkillSpec):

  • name, description, triggers
  • workflow steps, anti-patterns
  • output format, examples
  • content length, parse method, parse confidence

Parsing methods:

  1. YAML frontmatter extraction
  2. Markdown AST parsing (via markdown-it-py)
  3. Regex-based section extraction
  4. LLM-assisted fallback when needed

An 8-dimension confidence score is computed: frontmatter(0.30) + workflow(0.25) + headings(0.15) + anti-patterns(0.10) + output-format(0.08) + triggers(0.07) + examples(0.05) + bonus(0.05). Low confidence flags the results as unreliable.

Phase 0.5: Security Scanning

Implementation: engine/security_probes.pySecurityScanner.

Security scanning runs before test generation. It checks 5 categories:

Category Meaning
INJ Prompt Injection
EXF Data Exfiltration
DCMD Dangerous Commands
CRD Credential Access
OBF Obfuscation

52 built-in probe patterns across 6 categories (INJ/EXF/DCMD/CRD/OBF/PRIV_ESC). Results: PASS / WARN / BLOCK. A BLOCK verdict causes immediate evaluation failure.

Phase 1: Auto-Generate Eval Tests

Implementation: engine/testgen.pyEvalGenerator, fallback: templates/minimum-evals.json.

Skill-Cert auto-generates evaluation test cases from SkillSpec. Generation is not one-shot — it's a self-review loop:

Generate initial tests → Review coverage → Identify gaps → Fill gaps → Re-review → until coverage >= 90%

Coverage includes: trigger cases (should/should-not trigger), workflow step cases, anti-pattern cases, output format cases, security/robustness cases.

Key thresholds:

Threshold Meaning
coverage target = 90% Ideal coverage
coverage degrade = 70% Below this, degrade
coverage block = 70% Too low, block evaluation

If generation fails, minimum-evals.json is used as a fallback.

Phase 2: With-Skill / Without-Skill Execution

Implementation: engine/runner.pyEvalRunner.

The core insight: don't just look at Skill output — run a controlled experiment:

Same eval suite
  ├── without-skill: model without Skill loaded
  └── with-skill: model with Skill loaded

Then compare the two sets of results. If the model can already do the task well, the Skill adds no value. Only when with-skill significantly outperforms without-skill does the Skill demonstrate real improvement.

The Runner handles: concurrent execution, rate limiting, timeouts, token tracking, security scanning, operating envelope checks, partial failure preservation.

Default limits:

Item Default
max steps 20
max tool calls 15
token budget 50,000
timeout 300s
max concurrency 5
rate limit 60 RPM

Phase 3: Grading

Implementation: engine/grader.pyGrader, JudgeResult, EvalAssertion.

Two grading approaches:

Deterministic Assertions

Supports contains, not_contains, regex, starts_with, json_valid. Weighted: Normal(1), Important(2), Critical(3). Deterministic assertions are stable, cheap, and repeatable.

LLM-as-Judge

For complex behaviors (e.g., "did the model make a reasonable architectural trade-off?"), deterministic assertions may be insufficient. Skill-Cert can enable LLM-as-judge. Constraints: temperature must be 0, only used when deterministic checks are insufficient, L4 stability calculation excludes LLM judge results to avoid randomness.

Phase 4: L1-L8 Metrics

Implementation: engine/metrics.pyMetricsCalculator.

8-tier metric system:

Tier Name Measures Threshold
L1 Trigger Accuracy Does the model know when to use the Skill? >= 90%
L2 Output Delta Does with-skill outperform without-skill? >= 20%
L3 Step Adherence Does the model follow the workflow? >= 85%
L4 Stability Are results consistent across runs? std <= 10%
L5 Step Efficiency Within step/token/tool call limits? All pass
L6 Trajectory Quality Is multi-turn dialogue coherent? dialogue mode
L7 Cost Efficiency Is the cost justified? Under budget
L8 Latency Is latency acceptable? No slow requests

L1/L2 measure effectiveness. L3/L4 measure reliability. L5/L7/L8 measure efficiency. L6 measures multi-turn interaction quality.

Phase 5: Cross-Model Drift Detection

Implementation: engine/drift.pyDriftDetector.

Skill-Cert runs the same eval suite across multiple models and compares pass rate variance.

Drift severity:

Level Variance Meaning
none <= 0.10 Consistent
low <= 0.20 Minor, acceptable
moderate <= 0.35 Significant, needs attention
high > 0.35 Unstable, cannot release

Verdict impact: none/low → no effect on PASS, moderate → downgrade to PASS_WITH_CAVEATS, high → FAIL.

Phase 6: Report Generation

Implementation: engine/reporter.pyReporter.

Two output formats:

Markdown report: human-readable, includes executive summary, verdict, overall score, L1-L8 metrics, drift analysis, security scan, cost analysis, latency analysis, improvement suggestions, config summary.

JSON report: machine-readable structured data:

{
  "verdict": "PASS",
  "overall_score": 0.82,
  "metrics": {
    "l1_trigger_accuracy": 0.90,
    "l2_with_without_skill_delta": 0.25,
    "l3_step_adherence": 0.88,
    "l4_execution_stability": 0.93
  },
  "drift_analysis": {
    "highest_severity": "none",
    "average_variance": 0.0,
    "overall_verdict": "PASS"
  },
  "evaluation_coverage": {
    "total_evaluations": 207,
    "avg_pass_rate": 1.0
  }
}

Extended Capabilities

Capability Module Description
Multi-Skill Conflict Detection multi_skill.py Trigger overlap, prompt contamination, token overflow
Stress Testing stress_test.py Concurrency fairness, memory tracking, scalability scoring
Reliability Tracking reliability.py Error classification, retry stats, graceful degradation
Maintainability Scoring maintainability.py SKILL.md readability, completeness, freshness
External Integrations integrations.py SkillLab / DeepEval providers (graceful degradation)
Operating Envelope envelope.py Steps/tokens/timeout/tool_calls limit enforcement
Cost Analysis adapters/pricing.py 17 models across 6 provider families
OTel Telemetry engine/observability.py SessionTelemetry, record_trace, session summary
Token Ledger engine/token_ledger.py Real-time token usage tracking (not approximations)

5. Architecture

Skill-Cert follows Clean Architecture with explicit layer boundaries:

skill_cert/       Presentation layer: CLI entry
    ↓
engine/           Domain layer: core evaluation logic
    ↓
adapters/         Infrastructure layer: LLM provider adapters
    ↓
prompts/
schemas/
templates/        Support layer: prompts, schemas, templates

5.1 Presentation: CLI Layer

Location: skill_cert/cli/. Responsibilities: parse CLI arguments, load configuration, invoke core pipeline, emit exit codes, generate report files. Entry point: main.py, config wizard: setup.py.

5.2 Domain: Core Evaluation Layer

Location: engine/.

File Responsibility
analyzer.py Parse SKILL.md into SkillSpec
testgen.py Auto-generate eval tests
runner.py Execute with-skill / without-skill
grader.py Grade model outputs
metrics.py Calculate L1-L8 metrics
drift.py Cross-model drift detection
reporter.py Generate Markdown / JSON reports
security_probes.py Security scanning
envelope.py Operating envelope checks
config.py Configuration loading and validation
dialogue_evaluator.py Multi-turn dialogue evaluation
dialogue_runner.py Dialogue execution with OTel trace recording
replay.py Historical session replay
simulator.py LLM behavior simulation for testing
multi_skill.py Multi-skill conflict detection
stress_test.py Stress testing
reliability.py Reliability tracking
maintainability.py SKILL.md maintainability scoring
skills_bench.py Multi-skill cognitive overload detection
calibration.py Golden eval set calibration (Cohen's Kappa)
stability.py Execution stability analysis
integrations.py SkillLab / DeepEval external integrations
observability.py OTel GenAI session telemetry
token_ledger.py Real-time token usage tracking
trigger_accuracy_eval.py L1 trigger accuracy evaluation
trajectory_evaluator.py L6 trajectory quality evaluation
adversarial.py Adversarial testing support
gotchas_flywheel.py Gotcha patterns accumulation
progressive_disclosure.py Progressive disclosure evaluation
deadline.py Global deadline enforcement
constants.py Shared constants and defaults
report_models.py Report data models
trace_models.py Telemetry trace data models

5.3 Infrastructure: Model Adapter Layer

Location: adapters/base.py, adapters/anthropic_compat.py, adapters/openai_compat.py, adapters/pricing.py.

Responsibilities: define unified LLM calling protocol, adapt Anthropic / OpenAI-compatible APIs, track real token usage, compute cost from pricing table.

Pricing supports multiple model families: Anthropic, OpenAI, Qwen, DeepSeek, Gemini.

5.4 Support: Templates and Schemas

Location: prompts/, schemas/, templates/.

Responsibilities: LLM judge prompt, testgen prompt, test-review prompt, test-gap prompt, eval JSON schema, SkillSpec schema, minimum eval fallback template.


6. Usage

6.1 Installation

pip install -e .

Development mode:

pip install -e ".[dev]"

6.2 Single-Model Evaluation

skill-cert --skill path/to/SKILL.md \
  --models "m1=https://api.example.com/v1,$API_KEY" \
  --output ./results/

Best for: quick check if a Skill is functional, local debugging, generating a preliminary report.

6.3 Multi-Model Drift Detection

skill-cert --skill path/to/SKILL.md \
  --models "m1=url,key|m2=url,key" \
  --output ./results/

Best for: pre-release cross-model stability verification, comparing provider performance, discovering model dependencies.

6.4 Dialogue Mode

skill-cert --skill path/to/SKILL.md \
  --mode dialogue \
  --max-turns 10

Best for: Orchestration Skills, Debug Skills, QA Skills, Code Review Skills — any Skill requiring multi-turn decision-making.

6.5 Replay Regression Testing

skill-cert --skill path/to/SKILL.md \
  --mode replay \
  --session session.jsonl

Best for: historical session replay, before/after Skill change comparison, regression prevention.

6.6 Multi-Run Stability Testing

skill-cert --skill path/to/SKILL.md \
  --models "m1=url,key|m2=url,key" \
  --runs 5

Best for: computing L4 stability, detecting random variance, verifying Skill repeatability.

6.7 Stress Testing

skill-cert --skill path/to/SKILL.md \
  --stress \
  --stress-concurrency 50 \
  --stress-evals 100

Best for: validating high-concurrency behavior, detecting resource leaks, assessing scalability.

6.8 Verdict Logic

Verdict Conditions
PASS L1 >= 90%, L2 >= 20%, L3 >= 85%, L4 std <= 10%, drift none/low
PASS_WITH_CAVEATS Core metrics pass, but drift moderate
FAIL Any core metric fails, or drift high, or coverage < 70%

7. Configuration

7.1 Environment Variables

Variable Description
SKILL_CERT_MODELS Model config: name=url,key[,fallback]|name2=url,key
SKILL_CERT_MAX_CONCURRENCY Max concurrency (default: 5)
SKILL_CERT_RATE_LIMIT_RPM Rate limit RPM (default: 60)
SKILL_CERT_TIMEOUT Timeout in seconds (default: 300)
ANTHROPIC_API_KEY Anthropic API Key
OPENAI_API_KEY OpenAI-compatible API Key
OPENAI_BASE_URL OpenAI-compatible Base URL

7.2 Config File

~/.skill-cert/models.yaml:

models:
  - model_name: "qwen3.6-plus"
    base_url: "https://api.example.com/v1"
    api_key: "$API_KEY"
    fallback_model: "qwen3-coder-plus"

Priority: CLI args > environment variables > config file > defaults.


8. Development

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest

# Run with coverage
pytest --cov=engine --cov=skill_cert --cov=adapters --cov-report=term-missing

# Format and lint
ruff check . && ruff format .

Conventions:

  • Pydantic v2 for all data models
  • Type annotations on all function signatures
  • ruff for linting and formatting
  • pytest for testing (test files mirror engine/ module structure 1:1)
  • Prompt templates are .md files, not Python strings
  • No hardcoded secrets — API keys via environment variables or config file

Project structure:

skill-cert/
├── engine/          # Core pipeline: 33 modules — parser, testgen, runner, grader, metrics, reporter, drift,
│                    # dialogue_evaluator, dialogue_runner, replay, simulator, security_probes, envelope,
│                    # integrations, reliability, maintainability, multi_skill, stress_test, stability, config,
│                    # skills_bench, calibration, observability, token_ledger, trigger_accuracy_eval,
│                    # trajectory_evaluator, adversarial, gotchas_flywheel, progressive_disclosure,
│                    # deadline, constants, report_models, trace_models
├── skill_cert/cli/  # CLI entry (main.py, setup.py)
├── adapters/        # LLM provider adapters (Anthropic, OpenAI-compatible) + pricing table
├── prompts/         # LLM prompt templates (judge, dialogue, drift, testgen, test-review, test-gap)
├── schemas/         # JSON schemas (eval cases, SkillSpec)
├── templates/       # Fallback eval template (minimum-evals.json)
├── tests/           # pytest suite — 1134 tests, mirrors engine/ modules 1:1
└── results/         # Output: {skill}-report.md, {skill}-result.json, {skill}-evals-cache.json

Note: skill_cert/cli.py was deleted (shadowed by cli/ package directory).


9. Limitations & Caveats

9.1 Known Limitations

L3 Step Adherence granularity

L3 only checks "are steps covered", not intermediate decision quality (tool call correctness, turn-level relevance). A Skill can pass L3 while producing poor intermediate decisions.

L4 Stability needs more samples

Single-run --runs N computes std dev. Industry standard typically requires 5-10 independent trials for reliable confidence intervals.

LLM-as-judge lacks calibration

Current LLM-as-judge lacks:

  • Position bias handling (option order may affect judgment)
  • Human-annotated calibration (golden eval set)
  • Specific failure reasons (binary judgment only)

Dialogue evaluation relies on word overlap

Multi-turn dialogue evaluation currently over-relies on word overlap rather than semantic understanding, potentially missing or misclassifying quality issues.

Security scan coverage is limited

52 probe patterns across 6 categories (INJ/EXF/DCMD/CRD/OBF/PRIV_ESC). Industry recommendation is 100+ (e.g., SpecWeave). Some attack vectors may still be uncovered.

Single-model evaluation is insufficient

While multi-model is supported, single-model evaluation cannot detect model dependencies. A Skill may only work on one model.

9.2 Usage Notes

  • Requires API keys: Skill-Cert depends on LLM API calls. At least one model's API key must be configured. Evaluation incurs API costs.
  • Evaluation takes time: Full evaluation (multi-model, dialogue) may take tens of minutes to hours, depending on eval count and model response speed.
  • Results are model-dependent: The same Skill may produce different results on different models. Use at least 2 models from different providers.
  • Not 100% accurate: Automated evaluation cannot fully replace human review, especially for complex behavior judgment. Use Skill-Cert as a supplement to human review.
  • Coverage < 70% blocks evaluation: If the Skill structure is too simple or vague, sufficient tests may not be generated, blocking evaluation.
  • Do not modify eval cases after execution: Eval cases are locked after Phase 2 execution. Modification breaks evaluation integrity.

9.3 Industry Comparison Gaps

Dimension Current State Industry Reference
L1 trigger granularity Binary trigger judgment CodeIF's 50 sub-dimensions
L3 trajectory quality Missing turn-level quality metrics Turn-level evaluation needed
L4 statistical method Single-run std 5-10 trial confidence intervals
Uncertainty detection No CMP/CME Cross-model perplexity/entropy
Calibration dataset No human-annotated golden set Human-annotated calibration needed

10. License

This project is licensed under the Apache License 2.0. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skill_cert-0.5.4.tar.gz (272.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skill_cert-0.5.4-py3-none-any.whl (153.8 kB view details)

Uploaded Python 3

File details

Details for the file skill_cert-0.5.4.tar.gz.

File metadata

  • Download URL: skill_cert-0.5.4.tar.gz
  • Upload date:
  • Size: 272.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for skill_cert-0.5.4.tar.gz
Algorithm Hash digest
SHA256 3906f640d4039609aa66f966fb22993c3f9c0e28f9d4df192b110ae6b8d2f1cb
MD5 c8f59ec5958e8a720fc72aa9a9efc5b9
BLAKE2b-256 80ea5cee42cdbd8ae0659fa46f5ff476a63787ca3339c797b8e8942a6a8fecc6

See more details on using hashes here.

File details

Details for the file skill_cert-0.5.4-py3-none-any.whl.

File metadata

  • Download URL: skill_cert-0.5.4-py3-none-any.whl
  • Upload date:
  • Size: 153.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for skill_cert-0.5.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5e850b1e9d1c2ae0f2f97c5fd0cec095cdd7377018ce182306a6e74274fcdc84
MD5 de96a592d1e2172de9ff24240110a4a6
BLAKE2b-256 5039811c724c507363d729dd947231a07dcb192d7c9f7391a8bf302e903dcf4d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page