Skip to main content

Evaluate and compare AI agent setups through experiments, inspections, and rubric scoring.

Project description

setup-eval

CI PyPI Python 3.11+ License: Apache 2.0

Evaluate AI code agent setups for best practices, redundancy, security, and cross-component issues.

Available as a CLI tool, a Claude Code plugin, and Cursor commands.

Supports Claude Code and Cursor projects. Auto-detects which tool(s) a project uses.

What it does

Most tools test whether a skill produces correct output. This tool checks the setup itself: CLAUDE.md, skills, commands, hooks, MCP configs, agents, .cursor/rules/*.mdc, .cursorrules.

Four commands, same engine:

Command What it does LLM in CLI LLM in Claude Code / Cursor
setup-eval-lint 43 deterministic rules + system analysis (token budget, trigger overlaps, dependencies). Fast, CI-suitable. No No
setup-eval-review Per-component rubric review with 0-3 scoring per dimension, 21 cross-type checks. KEEP/REVIEW/REMOVE verdicts. Yes (API key) Yes (in-session)
setup-eval-security All security rules + YARA + CVE lookups + semantic review. SAFE/CAUTION/UNSAFE. Scan: no. Semantic review: --review flag Yes (in-session)
eval-skill Deep-evaluate one skill individually and in context of the full setup. Lint: no. Rubric: --rubric flag Yes (in-session)

Install

CLI tool

Install from PyPI and run from the terminal:

pip install setup-eval

setup-eval setup-eval-lint .
setup-eval setup-eval-lint . --watch     # re-run lint automatically on file changes
setup-eval setup-eval-review . --provider gemini
setup-eval setup-eval-security . --review
setup-eval eval-skill ./skills/my-skill --context . --rubric

Requires GEMINI_API_KEY or ANTHROPIC_API_KEY for review/security/skill commands.

setup-eval-security supports optional YARA malware signature scanning. To enable it: pip install setup-eval[yara]

Claude Code plugin

No pip install needed. Install directly from within Claude Code:

/plugin marketplace add redhat-community-ai-tools/harness-eval-lab
/plugin install setup-eval@setup-eval
/reload-plugins

The 4 commands appear in the / menu:

  • /setup-eval:setup-eval-lint
  • /setup-eval:setup-eval-review
  • /setup-eval:setup-eval-security
  • /setup-eval:eval-skill

No API key needed. Claude evaluates in-session.

Updating: Re-run the install command to get the latest rules.

Cursor commands

Requires the CLI tool installed first (Cursor commands call it for the deterministic scan):

pip install setup-eval

Then copy .cursor/commands/ from this repo into your project. The 4 commands appear in Cursor's command palette:

  • /setup-eval-lint
  • /setup-eval-review
  • /setup-eval-security
  • /eval-skill

No API key needed for review/security/skill. Cursor evaluates in-session.

Inspection Rules (43)

Category Rules What they check
Structural 1 SKILL.md exists
Frontmatter 3 Description required/quality, format valid
Content 4 Duplicate detection (TF-IDF), broken references, circular references, token budget
Security 9 Credential access, prompt injection (17 patterns), data exfiltration, obfuscation, reverse shells, AST analysis, taint tracking, MCP least-privilege, tool poisoning
Security (opt-in) 2 YARA signatures, CVE lookups via OSV.dev
Commands 8 Description, script exists, duplicates, credentials, injection, skill overlap, shadows built-in, references nonexistent skill
CLAUDE.md 3 Exists, skill duplication, generic advice detection
Hooks 1 Structure validation, dangerous patterns, network access
Agents 9 Description, skills exist, tool format, constraint matching, credentials, injection, exfiltration, obfuscation, reverse shells

Four presets: recommended (default), strict, security, pre-workflow.

Contributing

See CONTRIBUTING.md for adding rules and submitting PRs.

Changelog

See CHANGELOG.md for release history.

Future Plans

See future-plans/ for planned improvements (SARIF output, security benchmarks, runner abstraction, dynamic workflows, impact measurement).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

setup_eval-3.4.0.tar.gz (136.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

setup_eval-3.4.0-py3-none-any.whl (110.9 kB view details)

Uploaded Python 3

File details

Details for the file setup_eval-3.4.0.tar.gz.

File metadata

  • Download URL: setup_eval-3.4.0.tar.gz
  • Upload date:
  • Size: 136.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for setup_eval-3.4.0.tar.gz
Algorithm Hash digest
SHA256 bc598421946e51d6beac92a2b6ea8741d8b762592018a716121a0573a6ae9a27
MD5 c876b4276a93965e5e26d56bc1d26637
BLAKE2b-256 00ee17396d22ac7bc29c91db495713d971db86b25ef04cf459f116af66bfa98f

See more details on using hashes here.

Provenance

The following attestation bundles were made for setup_eval-3.4.0.tar.gz:

Publisher: publish.yml on redhat-community-ai-tools/harness-eval-lab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file setup_eval-3.4.0-py3-none-any.whl.

File metadata

  • Download URL: setup_eval-3.4.0-py3-none-any.whl
  • Upload date:
  • Size: 110.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for setup_eval-3.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f313cba2bf00ff3e228830819ddb1f23636cd6e0561f793eba22ec492d5533af
MD5 327de1b816859fd07998d5b257a97f20
BLAKE2b-256 be01942a71c26f38f85c74a99e3203c9b2010b317717a3e81afc97efc9928da9

See more details on using hashes here.

Provenance

The following attestation bundles were made for setup_eval-3.4.0-py3-none-any.whl:

Publisher: publish.yml on redhat-community-ai-tools/harness-eval-lab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page