Skip to main content

Deterministic evaluation tools for AI coding agents, exposed as an MCP server.

Project description

🛡️ agent-eval-mcp

Deterministic Evaluation and Guardrails for AI Coding Agents.

MCP Compatible Python 3.10+ License: MIT

Building autonomous coding agents is easy. Figuring out how to evaluate whether what they've done is actually good is incredibly hard.

agent-eval-mcp is a stateless, deterministic Model Context Protocol (MCP) server that stops AI agents from writing lazy, unverified, or hallucinated code. It provides language-agnostic rulesets and hybrid scoring to grade AI-generated revisions before they get merged.

⚠️ The Problem

When you ask an LLM to evaluate its own code, it suffers from sycophancy. It will confidently tell you its fix is perfect, even when it has:

  • Generated dummy patterns like new HashMap<>() or pass.
  • Left // TODO: implement this in the production patch.
  • Hallucinated the surrounding SEARCH/REPLACE context, breaking the Git patch.

💡 The Solution

This package exposes objective evaluation tools to your agentic workflows via the Model Context Protocol (MCP). It evaluates AI-generated code edits using fuzzy-matching and language-specific Abstract Syntax Tree (AST) rules (Java, Python, TypeScript, Go) to catch hallucinations deterministically.

It supports two edit formats:

  • Cursor SEARCH/REPLACE blocks — for interactive agentic coding sessions.
  • Standard unified diffs — for CI/CD pipelines and GitHub Action workflows, where diffs come from Pull Requests or git diff output.

It completely decouples the heavy lifting of code validation from your LLM orchestration layer.

🔧 Available Tools

Tool Input Format Use Case
verify_fix Cursor <<<< SEARCH >>>> REPLACE blocks Interactive agentic coding sessions in Cursor
verify_unified_diff Standard unified diff (git diff / GitHub PR) CI/CD pipelines and GitHub Action workflows
evaluate_revision_quality Behavioral signals (reflexion count, fetch count) Scoring AI revision quality objectively

verify_fix

Validates a Cursor-style <<<< SEARCH ==== >>>> REPLACE edit through four sequential checks: format, fuzzy grounding against the original file, dummy-pattern detection (AST + regex), and no-op detection.

verify_unified_diff

Validates a standard unified diff — the patch format produced by git diff or visible in a GitHub Pull Request's "Files changed" view.

  • The diff must target exactly one file. To obtain a single-file diff from git, run:
    git diff HEAD -- path/to/file.py
    
  • Runs the same four-stage pipeline as verify_fix: format → grounding → dummy-pattern detection → no-op check.
  • Supports the same language (java, python, typescript, go) and custom_patterns arguments.

evaluate_revision_quality

Derives a deterministic quality score (1–10) from observable facts about what the agent actually did during diagnosis — facts that cannot be fabricated by the LLM. Use it as a ceiling on the LLM's own self-reported confidence: final_score = min(llm_self_score, objective_score).

Signal table:

Signal Condition Penalty
reflexion_count == 1 −2
reflexion_count ≥ 2 −5
fetch_call_count 3–4 −1
fetch_call_count ≥ 5 −2
has_related_files False −1
has_file_content False −2

How to populate each signal (framework-agnostic):

  • reflexion_count — your retry counter: the number of times verify_fix or verify_unified_diff returned accepted=False before the current passing call. Pass 0 on the first accepted attempt.
  • fetch_call_count — count every file-read tool call your agent made during diagnosis (e.g. read_file, get_file_contents, fetch_github_file, or equivalent in your framework).
  • has_related_filesTrue if your agent retrieved at least one file other than the primary failing file (an import, a test file, a schema, a dependency).
  • has_file_contentTrue if your agent retrieved the full content of the primary file being fixed before generating the patch.

Local mode vs. cloud mode:

In local mode (default), has_related_files and has_file_content reflect what the agent actually fetched — penalties apply when it skipped context retrieval. In cloud mode, where file access is always pre-loaded or guaranteed, pass both as True unconditionally to avoid penalising the agent for something outside its control.

🚀 Quickstart

1. Install the Package

Install globally via pip so your MCP clients can execute it:

pip install agent-eval-mcp

2. Configure your MCP client

Add to ~/.cursor/mcp.json or your Claude Desktop config:

{
  "mcpServers": {
    "agent-eval": {
      "command": "agent-eval-mcp"
    }
  }
}

3. Use in your agentic workflow

For Cursor sessions — call verify_fix after every LLM-generated patch:

verify_fix(diagnosis="...", file_content="...", language="python")

For CI/CD / GitHub Actions — call verify_unified_diff with the PR diff:

verify_unified_diff(diff_text="...", file_content="...", language="java")

🖥️ CLI Usage (GitHub Actions / CI/CD)

The agent-eval command validates a unified diff directly from the terminal, making it a drop-in step for any CI/CD pipeline. It prints a JSON result to stdout and exits with 0 (pass) or 1 (fail) so GitHub Actions can block a Pull Request automatically.

Install

pip install agent-eval-mcp

Basic usage

# Produce a single-file diff, then validate it
git diff HEAD -- src/mymodule.py > pr.patch
agent-eval --diff-file pr.patch --source-file src/mymodule.py --language python

With custom rejection patterns

Pass --pattern once per regex. Commas inside quantifiers (e.g. {1,5}) are safe because patterns are never comma-split.

agent-eval --diff-file pr.patch --source-file src/mymodule.py --language python \
  --pattern "print\(.*\)" \
  --pattern "TODO|FIXME"

Example output

{
  "accepted": false,
  "rejection_reason": "This diff introduces dummy or non-production patterns...",
  "stage_failed": "dummy_patterns"
}

GitHub Actions example

- name: Validate AI-generated diff
  run: |
    git diff HEAD -- ${{ env.CHANGED_FILE }} > pr.patch
    agent-eval \
      --diff-file pr.patch \
      --source-file ${{ env.CHANGED_FILE }} \
      --language java

Arguments

Argument Required Default Description
--diff-file Yes Path to the unified diff file (output of git diff)
--source-file Yes Path to the original source file before the diff
--language No python Language ruleset: java, python, typescript, go
--pattern No Regex to reject in added lines (repeatable)

GitHub Action (CI/CD)

agent-eval-mcp ships as a reusable composite GitHub Action that you can drop into any repository workflow to automatically block Pull Requests that introduce AI-generated dummy patterns, grounding failures, or no-op diffs.

Usage

Reference the action in your workflow. Use uses: ./ if the action lives in the same repository, or uses: your-org/agent-eval-mcp@v1 when consuming it from a separate repo.

The patterns input is a newline-separated list — one regex per line. Commas inside quantifiers (e.g. {1,5}) are safe because patterns are never comma-split.

Production tip: The example below validates a single hardcoded file. For real PRs that touch multiple files, combine this action with tj-actions/changed-files and a matrix strategy to iterate over every modified file automatically.

name: AI Diff Quality Check

on:
  pull_request:
    branches: [main]

jobs:
  validate-ai-diff:
    name: Validate AI-generated changes
    runs-on: ubuntu-latest

    steps:
      # fetch-depth: 0 is required for git show and git diff against origin/main.
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      # Extract the pre-PR version of the file from the base branch.
      - name: Export original source file from base branch
        run: git show origin/main:src/main.py > original_main.py

      # Produce a unified diff scoped to the single file under review.
      - name: Generate single-file unified diff
        run: git diff origin/main...HEAD -- src/main.py > patch.diff

      # Run agent-eval. Exit code 1 automatically fails the PR check.
      - name: Validate diff with agent-eval
        uses: ./   # or: uses: your-org/agent-eval-mcp@v1
        with:
          diff_file: patch.diff
          source_file: original_main.py
          language: python
          patterns: |
            print\(.*\)
            TODO|FIXME

Action Inputs

Input Required Default Description
diff_file Yes Path to the unified diff file (single-file git diff output)
source_file Yes Path to the original source file before the diff
language No python Language ruleset: java, python, typescript, go
patterns No Newline-separated regex patterns to reject in added lines

Telemetry & Dashboards

agent-eval can optionally stream every evaluation result to a Supabase PostgreSQL database, enabling a central dashboard for tracking AI code-quality trends across all repositories in your organization.

The feature is completely opt-in and non-blocking:

  • If any of the required environment variables are absent, no data is sent and the CLI behaves identically.
  • Telemetry is sent from a detached background process so it never adds latency to the CLI exit. Network failures and other errors are silently discarded — a broken telemetry path will never fail the pipeline or appear in CI output.

Activation

Set the following environment variables in your CI/CD runner or GitHub Actions secret store:

Variable Required Description
SUPABASE_URL Yes Base URL of your Supabase project (e.g. https://xxxx.supabase.co)
SUPABASE_KEY Yes Anon/public API key for your Supabase project
SENTINEL_ORG_ID Yes UUID of the organization that owns this pipeline

The following variable is read automatically from the GitHub Actions environment and does not need to be configured manually:

Variable Default Description
GITHUB_REPOSITORY local-dev Repository name (owner/repo), set automatically by GitHub Actions

Example — GitHub Actions

- name: Validate diff with agent-eval
  env:
    SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
    SUPABASE_KEY: ${{ secrets.SUPABASE_KEY }}
    SENTINEL_ORG_ID: ${{ secrets.SENTINEL_ORG_ID }}
  run: |
    agent-eval --diff-file patch.diff --source-file original.py --language python

Each evaluation writes one row to the evaluations table with the fields: org_id, repository, file_path, language, accepted, stage_failed, rejection_reason, and timestamp (ISO-8601 UTC).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_eval_mcp-0.2.0.tar.gz (42.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_eval_mcp-0.2.0-py3-none-any.whl (31.7 kB view details)

Uploaded Python 3

File details

Details for the file agent_eval_mcp-0.2.0.tar.gz.

File metadata

  • Download URL: agent_eval_mcp-0.2.0.tar.gz
  • Upload date:
  • Size: 42.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agent_eval_mcp-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f9e0184106ee7eb94d1300912a759521b0b6ee4678385a10907ad1fdafefec75
MD5 14d3d1be776b686d85b9ea167cd8adb0
BLAKE2b-256 c514a56704b148f41b16d893be8926b0108795257e5fcf5552eaf32f7c62b91b

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_eval_mcp-0.2.0.tar.gz:

Publisher: publish.yml on nicolaemorcov/agent-eval-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agent_eval_mcp-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: agent_eval_mcp-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 31.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agent_eval_mcp-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 95a3183bcb2708d0cebf63f194fa25b44071c42ecd70430e3c74572699d36c67
MD5 82fb7bb7c4562f9c83aaf239ab073999
BLAKE2b-256 dfe031e67dcc598f609b0e0f7d52f4aadc7fa37edcb1a9b9030ca90e0941f720

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_eval_mcp-0.2.0-py3-none-any.whl:

Publisher: publish.yml on nicolaemorcov/agent-eval-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page