Deterministic evaluation tools for AI coding agents, exposed as an MCP server.
Project description
🛡️ agent-eval-mcp
Deterministic Evaluation and Guardrails for AI Coding Agents.
Building autonomous coding agents is easy. Figuring out how to evaluate whether what they've done is actually good is incredibly hard.
agent-eval-mcp is a stateless, deterministic Model Context Protocol (MCP) server that stops AI agents from writing lazy, unverified, or hallucinated code. It provides language-agnostic rulesets and hybrid scoring to grade AI-generated revisions before they get merged.
⚠️ The Problem
When you ask an LLM to evaluate its own code, it suffers from sycophancy. It will confidently tell you its fix is perfect, even when it has:
- Generated dummy patterns like
new HashMap<>()orpass. - Left
// TODO: implement thisin the production patch. - Hallucinated the surrounding
SEARCH/REPLACEcontext, breaking the Git patch.
💡 The Solution
This package exposes objective evaluation tools to your agentic workflows via the Model Context Protocol (MCP). It evaluates AI-generated code edits using fuzzy-matching and language-specific Abstract Syntax Tree (AST) rules (Java, Python, TypeScript, Go) to catch hallucinations deterministically.
It supports two edit formats:
- Cursor SEARCH/REPLACE blocks — for interactive agentic coding sessions.
- Standard unified diffs — for CI/CD pipelines and GitHub Action workflows, where diffs come from Pull Requests or
git diffoutput.
It completely decouples the heavy lifting of code validation from your LLM orchestration layer.
🔧 Available Tools
| Tool | Input Format | Use Case |
|---|---|---|
verify_fix |
Cursor <<<< SEARCH >>>> REPLACE blocks |
Interactive agentic coding sessions in Cursor |
verify_unified_diff |
Standard unified diff (git diff / GitHub PR) |
CI/CD pipelines and GitHub Action workflows |
evaluate_revision_quality |
Behavioral signals (reflexion count, fetch count) | Scoring AI revision quality objectively |
verify_fix
Validates a Cursor-style <<<< SEARCH ==== >>>> REPLACE edit through four sequential checks: format, fuzzy grounding against the original file, dummy-pattern detection (AST + regex), and no-op detection.
verify_unified_diff
Validates a standard unified diff — the patch format produced by git diff or visible in a GitHub Pull Request's "Files changed" view.
- The diff must target exactly one file. To obtain a single-file diff from git, run:
git diff HEAD -- path/to/file.py
- Runs the same four-stage pipeline as
verify_fix: format → grounding → dummy-pattern detection → no-op check. - Supports the same
language(java,python,typescript,go) andcustom_patternsarguments.
evaluate_revision_quality
Derives a deterministic quality score (1–10) from observable facts about what the agent actually did during diagnosis — facts that cannot be fabricated by the LLM. Use it as a ceiling on the LLM's own self-reported confidence: final_score = min(llm_self_score, objective_score).
Signal table:
| Signal | Condition | Penalty |
|---|---|---|
reflexion_count |
== 1 | −2 |
reflexion_count |
≥ 2 | −5 |
fetch_call_count |
3–4 | −1 |
fetch_call_count |
≥ 5 | −2 |
has_related_files |
False |
−1 |
has_file_content |
False |
−2 |
How to populate each signal (framework-agnostic):
reflexion_count— your retry counter: the number of timesverify_fixorverify_unified_diffreturnedaccepted=Falsebefore the current passing call. Pass0on the first accepted attempt.fetch_call_count— count every file-read tool call your agent made during diagnosis (e.g.read_file,get_file_contents,fetch_github_file, or equivalent in your framework).has_related_files—Trueif your agent retrieved at least one file other than the primary failing file (an import, a test file, a schema, a dependency).has_file_content—Trueif your agent retrieved the full content of the primary file being fixed before generating the patch.
Local mode vs. cloud mode:
In local mode (default), has_related_files and has_file_content reflect what the agent actually fetched — penalties apply when it skipped context retrieval. In cloud mode, where file access is always pre-loaded or guaranteed, pass both as True unconditionally to avoid penalising the agent for something outside its control.
🚀 Quickstart
1. Install the Package
Install globally via pip so your MCP clients can execute it:
pip install agent-eval-mcp
2. Configure your MCP client
Add to ~/.cursor/mcp.json or your Claude Desktop config:
{
"mcpServers": {
"agent-eval": {
"command": "agent-eval-mcp"
}
}
}
3. Use in your agentic workflow
For Cursor sessions — call verify_fix after every LLM-generated patch:
verify_fix(diagnosis="...", file_content="...", language="python")
For CI/CD / GitHub Actions — call verify_unified_diff with the PR diff:
verify_unified_diff(diff_text="...", file_content="...", language="java")
🖥️ CLI Usage (GitHub Actions / CI/CD)
The agent-eval command validates a unified diff directly from the terminal, making it a drop-in step for any CI/CD pipeline. It prints a JSON result to stdout and exits with 0 (pass) or 1 (fail) so GitHub Actions can block a Pull Request automatically.
Install
pip install agent-eval-mcp
Basic usage
# Produce a single-file diff, then validate it
git diff HEAD -- src/mymodule.py > pr.patch
agent-eval --diff-file pr.patch --source-file src/mymodule.py --language python
With custom rejection patterns
Pass --pattern once per regex. Commas inside quantifiers (e.g. {1,5}) are safe because patterns are never comma-split.
agent-eval --diff-file pr.patch --source-file src/mymodule.py --language python \
--pattern "print\(.*\)" \
--pattern "TODO|FIXME"
Example output
{
"accepted": false,
"rejection_reason": "This diff introduces dummy or non-production patterns...",
"stage_failed": "dummy_patterns"
}
GitHub Actions example
- name: Validate AI-generated diff
run: |
git diff HEAD -- ${{ env.CHANGED_FILE }} > pr.patch
agent-eval \
--diff-file pr.patch \
--source-file ${{ env.CHANGED_FILE }} \
--language java
Arguments
| Argument | Required | Default | Description |
|---|---|---|---|
--diff-file |
Yes | — | Path to the unified diff file (output of git diff) |
--source-file |
Yes | — | Path to the original source file before the diff |
--language |
No | python |
Language ruleset: java, python, typescript, go |
--pattern |
No | — | Regex to reject in added lines (repeatable) |
GitHub Action (CI/CD)
agent-eval-mcp ships as a reusable composite GitHub Action that you can drop into any repository workflow to automatically block Pull Requests that introduce AI-generated dummy patterns, grounding failures, or no-op diffs.
Usage
Reference the action in your workflow. Use uses: ./ if the action lives in the same repository, or uses: your-org/agent-eval-mcp@v1 when consuming it from a separate repo.
The patterns input is a newline-separated list — one regex per line. Commas inside quantifiers (e.g. {1,5}) are safe because patterns are never comma-split.
Production tip: The example below validates a single hardcoded file. For real PRs that touch multiple files, combine this action with
tj-actions/changed-filesand a matrix strategy to iterate over every modified file automatically.
name: AI Diff Quality Check
on:
pull_request:
branches: [main]
jobs:
validate-ai-diff:
name: Validate AI-generated changes
runs-on: ubuntu-latest
steps:
# fetch-depth: 0 is required for git show and git diff against origin/main.
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0
# Extract the pre-PR version of the file from the base branch.
- name: Export original source file from base branch
run: git show origin/main:src/main.py > original_main.py
# Produce a unified diff scoped to the single file under review.
- name: Generate single-file unified diff
run: git diff origin/main...HEAD -- src/main.py > patch.diff
# Run agent-eval. Exit code 1 automatically fails the PR check.
- name: Validate diff with agent-eval
uses: ./ # or: uses: your-org/agent-eval-mcp@v1
with:
diff_file: patch.diff
source_file: original_main.py
language: python
patterns: |
print\(.*\)
TODO|FIXME
Action Inputs
| Input | Required | Default | Description |
|---|---|---|---|
diff_file |
Yes | — | Path to the unified diff file (single-file git diff output) |
source_file |
Yes | — | Path to the original source file before the diff |
language |
No | python |
Language ruleset: java, python, typescript, go |
patterns |
No | — | Newline-separated regex patterns to reject in added lines |
Telemetry & Dashboards
agent-eval can optionally stream every evaluation result to a Supabase PostgreSQL database, enabling a central dashboard for tracking AI code-quality trends across all repositories in your organization.
The feature is completely opt-in and non-blocking:
- If any of the required environment variables are absent, no data is sent and the CLI behaves identically.
- Telemetry is sent from a detached background process so it never adds latency to the CLI exit. Network failures and other errors are silently discarded — a broken telemetry path will never fail the pipeline or appear in CI output.
Activation
Set the following environment variables in your CI/CD runner or GitHub Actions secret store:
| Variable | Required | Description |
|---|---|---|
SUPABASE_URL |
Yes | Base URL of your Supabase project (e.g. https://xxxx.supabase.co) |
SUPABASE_KEY |
Yes | Anon/public API key for your Supabase project |
SENTINEL_ORG_ID |
Yes | UUID of the organization that owns this pipeline |
The following variable is read automatically from the GitHub Actions environment and does not need to be configured manually:
| Variable | Default | Description |
|---|---|---|
GITHUB_REPOSITORY |
local-dev |
Repository name (owner/repo), set automatically by GitHub Actions |
Example — GitHub Actions
- name: Validate diff with agent-eval
env:
SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
SUPABASE_KEY: ${{ secrets.SUPABASE_KEY }}
SENTINEL_ORG_ID: ${{ secrets.SENTINEL_ORG_ID }}
run: |
agent-eval --diff-file patch.diff --source-file original.py --language python
Each evaluation writes one row to the evaluations table with the fields: org_id, repository, file_path, language, accepted, stage_failed, rejection_reason, and timestamp (ISO-8601 UTC).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_eval_mcp-0.2.0.tar.gz.
File metadata
- Download URL: agent_eval_mcp-0.2.0.tar.gz
- Upload date:
- Size: 42.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9e0184106ee7eb94d1300912a759521b0b6ee4678385a10907ad1fdafefec75
|
|
| MD5 |
14d3d1be776b686d85b9ea167cd8adb0
|
|
| BLAKE2b-256 |
c514a56704b148f41b16d893be8926b0108795257e5fcf5552eaf32f7c62b91b
|
Provenance
The following attestation bundles were made for agent_eval_mcp-0.2.0.tar.gz:
Publisher:
publish.yml on nicolaemorcov/agent-eval-mcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_eval_mcp-0.2.0.tar.gz -
Subject digest:
f9e0184106ee7eb94d1300912a759521b0b6ee4678385a10907ad1fdafefec75 - Sigstore transparency entry: 1186486980
- Sigstore integration time:
-
Permalink:
nicolaemorcov/agent-eval-mcp@a62d0d8db411c1415bfa4726226ea6f38b102d97 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/nicolaemorcov
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a62d0d8db411c1415bfa4726226ea6f38b102d97 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file agent_eval_mcp-0.2.0-py3-none-any.whl.
File metadata
- Download URL: agent_eval_mcp-0.2.0-py3-none-any.whl
- Upload date:
- Size: 31.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95a3183bcb2708d0cebf63f194fa25b44071c42ecd70430e3c74572699d36c67
|
|
| MD5 |
82fb7bb7c4562f9c83aaf239ab073999
|
|
| BLAKE2b-256 |
dfe031e67dcc598f609b0e0f7d52f4aadc7fa37edcb1a9b9030ca90e0941f720
|
Provenance
The following attestation bundles were made for agent_eval_mcp-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on nicolaemorcov/agent-eval-mcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_eval_mcp-0.2.0-py3-none-any.whl -
Subject digest:
95a3183bcb2708d0cebf63f194fa25b44071c42ecd70430e3c74572699d36c67 - Sigstore transparency entry: 1186486981
- Sigstore integration time:
-
Permalink:
nicolaemorcov/agent-eval-mcp@a62d0d8db411c1415bfa4726226ea6f38b102d97 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/nicolaemorcov
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a62d0d8db411c1415bfa4726226ea6f38b102d97 -
Trigger Event:
workflow_dispatch
-
Statement type: