Evaluate and compare AI agent setups through experiments, inspections, and rubric scoring.
Project description
setup-eval
Evaluate AI agent setups for best practices, redundancy, security, and cross-component issues.
What it does
Most agent evaluation tools test whether a skill completes a task correctly. This tool evaluates the entire setup that surrounds the agent: CLAUDE.md, skills, commands, hooks, MCP configs, and sub-agents.
It checks whether each component follows best practices, whether components work well together, and whether anything is redundant, conflicting, or insecure.
Supported tools: Claude Code and Cursor. The tool auto-detects which tool(s) a project uses and evaluates all discovered components.
Overview
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ setup-eval-lint │ │ setup-eval- │ │ setup-eval- │ │ setup-eval- │
│ │ │ review │ │ security │ │ skill │
│ 43 rules │ │ per-component │ │ all security │ │ deep-dive on │
│ system analysis │ │ rubrics │ │ rules │ │ one skill │
│ token budget │ │ 21 cross-type │ │ AST + taint │ │ lint + rubric │
│ trigger overlap │ │ checks │ │ YARA + CVE │ │ + contextual │
│ dependencies │ │ instruction │ │ 4-check │ │ analysis │
│ context util │ │ clarity │ │ semantic review │ │ │
│ │ │ KEEP / REVIEW │ │ SAFE / CAUTION │ │ KEEP / REVIEW │
│ no LLM, fast │ │ / REMOVE │ │ / UNSAFE │ │ / REMOVE │
└─────────────────┘ └─────────────────┘ └─────────────────┘ └─────────────────┘
"does it pass?" "is it effective?" "is it safe?" "how is this skill?"
Install
From PyPI
pip install setup-eval
From source
git clone https://github.com/redhat-community-ai-tools/harness-eval-lab.git
cd setup-eval
uv sync
Optional extras:
uv sync --extra llm # LLM support (for review CLI and eval-skill --rubric)
uv sync --extra security # YARA signature scanning (for security)
As a Claude Code plugin
Install directly from within Claude Code:
/plugin marketplace add redhat-community-ai-tools/harness-eval-lab
/plugin install setup-eval@setup-eval
/reload-plugins
Updating: Re-run the install command periodically to get the latest rules and improvements. Follow the repository for release announcements.
For Cursor users
Install the CLI:
pip install setup-eval
setup-eval setup-eval-lint /path/to/your/project
To use the commands inside Cursor, copy the .cursor/commands/ directory from this repo into your project's .cursor/commands/. The 4 eval commands will appear in Cursor's command palette:
setup-eval-lint- fast static analysis (no LLM)setup-eval-review- full LLM reviewsetup-eval-security- deep security auditsetup-eval-skill- deep-evaluate one skill
Or test locally during development:
claude --plugin-dir /path/to/setup-eval
After installing, these commands become available in / autocomplete:
/setup-eval:setup-eval-lint- fast static analysis, no LLM, CI-suitable/setup-eval:setup-eval-review- full qualitative review with KEEP/REVIEW/REMOVE verdicts/setup-eval:setup-eval-security- deep security audit with deterministic scan + semantic review/setup-eval:eval-skill <skill-name>- deep-evaluate one skill in context
Usage
CLI
setup-eval setup-eval-lint /path/to/project
setup-eval setup-eval-lint /path/to/project --preset strict --format json
setup-eval setup-eval-lint /path/to/project --fail-on-error
export GEMINI_API_KEY=your-key # or ANTHROPIC_API_KEY
setup-eval setup-eval-review /path/to/project
setup-eval setup-eval-review /path/to/project --provider anthropic --model claude-sonnet-4-20250514
setup-eval setup-eval-security /path/to/project
setup-eval setup-eval-security /path/to/project --review --provider gemini
setup-eval eval-skill /path/to/skills/my-skill --context /path/to/project
setup-eval eval-skill /path/to/skills/my-skill --context /path/to/project --rubric
Note on /setup-eval-security: The YARA signature scanning check requires yara-python. If not installed, the YARA check is skipped automatically and the report notes it. All other security checks run without extra dependencies. To enable YARA scanning:
pip install yara-python
CLI Commands
| Command | Description | Needs LLM? |
|---|---|---|
setup-eval-lint |
39 deterministic rules + system analysis (budget, triggers, deps, context utilization). | No |
setup-eval-review |
Per-component rubric review, 21 cross-type checks, KEEP/REVIEW/REMOVE verdicts. | Yes (API key) |
setup-eval-security |
All security rules + YARA + CVE lookups + optional LLM semantic review. | Optional (--review) |
setup-eval-skill |
Deep-evaluate a single skill individually and in context of the setup. | Optional (--rubric) |
Plugin Skills
| Skill | Description | Needs LLM? |
|---|---|---|
/setup-eval-lint |
43 rules, system analysis. Fast, CI-suitable. | No |
/setup-eval-review |
Per-component rubrics, 21 cross-type checks, KEEP/REVIEW/REMOVE verdicts. | Yes (Claude in-session) |
/setup-eval-security |
Deterministic security scan + semantic security review with 4-check checklist. | Yes (Claude in-session) |
/setup-eval-skill |
Deep-evaluate one skill against rubric + contextual analysis. | Yes (Claude in-session) |
Inspection Rules (43)
| Category | Rules | What they check |
|---|---|---|
| Structural | 1 | SKILL.md exists |
| Frontmatter | 3 | Description required/quality (POV, use-case, length), format valid |
| Content | 4 | Duplicate detection (TF-IDF), broken references, circular references, token budget |
| Security | 9 | Credential access, prompt injection (17 patterns), data exfiltration, obfuscation, reverse shells, AST behavioral analysis, taint tracking, MCP least-privilege, MCP tool poisoning |
| Security (opt-in) | 2 | YARA signature scanning, CVE lookups via OSV.dev (only in setup-eval-security) |
| Commands | 8 | Description, script exists, duplicates, credentials, injection, skill overlap, shadows built-in, references nonexistent skill |
| CLAUDE.md | 3 | Exists, skill duplication, generic advice detection |
| Hooks | 1 | Structure validation, dangerous patterns |
| Agents | 9 | Description, skills exist, tool format, constraint matching, credentials, injection, exfiltration, obfuscation, reverse shells |
Four presets: recommended (default), strict, security, pre-workflow.
Future Plans
The future-plans/ directory contains planned improvements, each in its own subfolder. Each doc explores a problem, presents approaches with trade-offs, and describes how to build it.
Every plan doc has a Status at the top:
| Status | Meaning |
|---|---|
future |
Idea documented, not yet planned for implementation |
in design |
Actively being designed, approaches being evaluated |
in progress |
Implementation underway |
built |
Implemented and merged |
| Plan | What it addresses |
|---|---|
| adjusting-to-dynamic-workflows | Adapting to Claude Code's dynamic workflows (pre-flight checks, workflow evaluation, quality gates) |
| test-coverage | Expanding tests to cover all rules with edge cases |
| runner-abstraction | Evaluating setups for other agent tools (Cursor, Copilot, Windsurf) |
| impact-dimension | Measuring whether a setup actually helps the agent (A/B testing) |
| scoring-calibration | Validating review accuracy against human judgment |
| sarif-output | SARIF output format for GitHub code scanning (inline PR annotations, Security tab alerts) |
| security-benchmarks | Benchmarking security rules against known-malicious and benign setups (TPR/FPR measurement) |
| setup-recommend | Recommending missing components based on project stack profiling |
Contributing
See how-to-contribute.md for guidelines on adding rules, future plans, and submitting PRs.
Changelog
See CHANGELOG.md for release history and notable changes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file setup_eval-3.1.0.tar.gz.
File metadata
- Download URL: setup_eval-3.1.0.tar.gz
- Upload date:
- Size: 127.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37098677d1dfd7ad6709d0d5e1511ae5960f576ab65705d5bec356524e0c3c56
|
|
| MD5 |
34fa7915ea09734f729638bd91d9345b
|
|
| BLAKE2b-256 |
1104fb627f5f99b6d6ee5cab7f3bcc0c85627bfd1d6c1e95be54c4060637215d
|
Provenance
The following attestation bundles were made for setup_eval-3.1.0.tar.gz:
Publisher:
publish.yml on redhat-community-ai-tools/harness-eval-lab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
setup_eval-3.1.0.tar.gz -
Subject digest:
37098677d1dfd7ad6709d0d5e1511ae5960f576ab65705d5bec356524e0c3c56 - Sigstore transparency entry: 1846608796
- Sigstore integration time:
-
Permalink:
redhat-community-ai-tools/harness-eval-lab@5678862d084c7e1ae0a404ffdd9d3534931ec86c -
Branch / Tag:
refs/tags/v3.1.0 - Owner: https://github.com/redhat-community-ai-tools
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5678862d084c7e1ae0a404ffdd9d3534931ec86c -
Trigger Event:
push
-
Statement type:
File details
Details for the file setup_eval-3.1.0-py3-none-any.whl.
File metadata
- Download URL: setup_eval-3.1.0-py3-none-any.whl
- Upload date:
- Size: 105.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84face0fd94c38bd704f9d4ac878a1ee09ccec44ba5a07933f965a9db7b9ae01
|
|
| MD5 |
1225dc7baa053c3d2277288a82e5c9cd
|
|
| BLAKE2b-256 |
9dfc2ceb0d1e0b5a27339e1cf3c8d1fcf6d31d2f6401f6df3ad64c601ca9a288
|
Provenance
The following attestation bundles were made for setup_eval-3.1.0-py3-none-any.whl:
Publisher:
publish.yml on redhat-community-ai-tools/harness-eval-lab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
setup_eval-3.1.0-py3-none-any.whl -
Subject digest:
84face0fd94c38bd704f9d4ac878a1ee09ccec44ba5a07933f965a9db7b9ae01 - Sigstore transparency entry: 1846608852
- Sigstore integration time:
-
Permalink:
redhat-community-ai-tools/harness-eval-lab@5678862d084c7e1ae0a404ffdd9d3534931ec86c -
Branch / Tag:
refs/tags/v3.1.0 - Owner: https://github.com/redhat-community-ai-tools
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5678862d084c7e1ae0a404ffdd9d3534931ec86c -
Trigger Event:
push
-
Statement type: