Deterministic evaluation tools for AI coding agents, exposed as an MCP server.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nikomo750

These details have not been verified by PyPI

Project description

🛡️ agent-eval-mcp

Deterministic Evaluation and Guardrails for AI Coding Agents.

Building autonomous coding agents is easy. Figuring out how to evaluate whether what they've done is actually good is incredibly hard.

agent-eval-mcp is a stateless, deterministic Model Context Protocol (MCP) server that stops AI agents from writing lazy, unverified, or hallucinated code. It provides language-agnostic rulesets and hybrid scoring to grade AI-generated revisions before they get merged.

⚠️ The Problem

When you ask an LLM to evaluate its own code, it suffers from sycophancy. It will confidently tell you its fix is perfect, even when it has:

Generated dummy patterns like new HashMap<>() or pass.
Left // TODO: implement this in the production patch.
Hallucinated the surrounding SEARCH/REPLACE context, breaking the Git patch.

💡 The Solution

This package exposes objective evaluation tools to your agentic workflows via the Model Context Protocol (MCP). It evaluates AI-generated code edits using fuzzy-matching and language-specific Abstract Syntax Tree (AST) rules (Java, Python, TypeScript, Go) to catch hallucinations deterministically.

It supports two edit formats:

Cursor SEARCH/REPLACE blocks — for interactive agentic coding sessions.
Standard unified diffs — for CI/CD pipelines and GitHub Action workflows, where diffs come from Pull Requests or git diff output.

It completely decouples the heavy lifting of code validation from your LLM orchestration layer.

🔧 Available Tools

Tool	Input Format	Use Case
`verify_fix`	Cursor `<<<< SEARCH >>>> REPLACE` blocks	Interactive agentic coding sessions in Cursor
`verify_unified_diff`	Standard unified diff (`git diff` / GitHub PR)	CI/CD pipelines and GitHub Action workflows
`evaluate_revision_quality`	Behavioral signals (reflexion count, fetch count)	Scoring AI revision quality objectively

`verify_fix`

Validates a Cursor-style <<<< SEARCH ==== >>>> REPLACE edit through four sequential checks: format, fuzzy grounding against the original file, dummy-pattern detection (AST + regex), and no-op detection.

`verify_unified_diff`

Validates a standard unified diff — the patch format produced by git diff or visible in a GitHub Pull Request's "Files changed" view.

The diff must target exactly one file. To obtain a single-file diff from git, run:
```
git diff HEAD -- path/to/file.py
```
Runs the same four-stage pipeline as verify_fix: format → grounding → dummy-pattern detection → no-op check.
Supports the same language (java, python, typescript, go) and custom_patterns arguments.

`evaluate_revision_quality`

Derives a deterministic quality score (1–10) from observable facts about what the agent actually did during diagnosis — facts that cannot be fabricated by the LLM. Use it as a ceiling on the LLM's own self-reported confidence: final_score = min(llm_self_score, objective_score).

Signal table:

Signal	Condition	Penalty
`reflexion_count`	== 1	−2
`reflexion_count`	≥ 2	−5
`fetch_call_count`	3–4	−1
`fetch_call_count`	≥ 5	−2
`has_related_files`	`False`	−1
`has_file_content`	`False`	−2

How to populate each signal (framework-agnostic):

reflexion_count — your retry counter: the number of times verify_fix or verify_unified_diff returned accepted=False before the current passing call. Pass 0 on the first accepted attempt.
fetch_call_count — count every file-read tool call your agent made during diagnosis (e.g. read_file, get_file_contents, fetch_github_file, or equivalent in your framework).
has_related_files — True if your agent retrieved at least one file other than the primary failing file (an import, a test file, a schema, a dependency).
has_file_content — True if your agent retrieved the full content of the primary file being fixed before generating the patch.

Local mode vs. cloud mode:

In local mode (default), has_related_files and has_file_content reflect what the agent actually fetched — penalties apply when it skipped context retrieval. In cloud mode, where file access is always pre-loaded or guaranteed, pass both as True unconditionally to avoid penalising the agent for something outside its control.

🚀 Quickstart

1. Install the Package

Install globally via pip so your MCP clients can execute it:

pip install agent-eval-mcp

2. Configure your MCP client

Add to ~/.cursor/mcp.json or your Claude Desktop config:

{
  "mcpServers": {
    "agent-eval": {
      "command": "agent-eval-mcp"
    }
  }
}

3. Use in your agentic workflow

For Cursor sessions — call verify_fix after every LLM-generated patch:

verify_fix(diagnosis="...", file_content="...", language="python")

For CI/CD / GitHub Actions — call verify_unified_diff with the PR diff:

verify_unified_diff(diff_text="...", file_content="...", language="java")

🖥️ CLI Usage (GitHub Actions / CI/CD)

The agent-eval command validates a unified diff directly from the terminal, making it a drop-in step for any CI/CD pipeline. It prints a JSON result to stdout and exits with 0 (pass) or 1 (fail) so GitHub Actions can block a Pull Request automatically.

Install

pip install agent-eval-mcp

Basic usage

# Produce a single-file diff, then validate it
git diff HEAD -- src/mymodule.py > pr.patch
agent-eval --diff-file pr.patch --source-file src/mymodule.py --language python

With custom rejection patterns

Pass --pattern once per regex. Commas inside quantifiers (e.g. {1,5}) are safe because patterns are never comma-split.

agent-eval --diff-file pr.patch --source-file src/mymodule.py --language python \
  --pattern "print\(.*\)" \
  --pattern "TODO|FIXME"

Example output

{
  "accepted": false,
  "rejection_reason": "This diff introduces dummy or non-production patterns...",
  "stage_failed": "dummy_patterns"
}

GitHub Actions example

- name: Validate AI-generated diff
  run: |
    git diff HEAD -- ${{ env.CHANGED_FILE }} > pr.patch
    agent-eval \
      --diff-file pr.patch \
      --source-file ${{ env.CHANGED_FILE }} \
      --language java

Arguments

Argument	Required	Default	Description
`--diff-file`	Yes	—	Path to the unified diff file (output of `git diff`)
`--source-file`	Yes	—	Path to the original source file before the diff
`--language`	No	`python`	Language ruleset: `java`, `python`, `typescript`, `go`
`--pattern`	No	—	Regex to reject in added lines (repeatable)

GitHub Action (CI/CD)

agent-eval-mcp ships as a reusable composite GitHub Action that you can drop into any repository workflow to automatically block Pull Requests that introduce AI-generated dummy patterns, grounding failures, or no-op diffs.

Usage

Reference the action in your workflow. Use uses: ./ if the action lives in the same repository, or uses: your-org/agent-eval-mcp@v1 when consuming it from a separate repo.

The patterns input is a newline-separated list — one regex per line. Commas inside quantifiers (e.g. {1,5}) are safe because patterns are never comma-split.

Production tip: The example below validates a single hardcoded file. For real PRs that touch multiple files, combine this action with tj-actions/changed-files and a matrix strategy to iterate over every modified file automatically.

name: AI Diff Quality Check

on:
  pull_request:
    branches: [main]

jobs:
  validate-ai-diff:
    name: Validate AI-generated changes
    runs-on: ubuntu-latest

    steps:
      # fetch-depth: 0 is required for git show and git diff against origin/main.
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      # Extract the pre-PR version of the file from the base branch.
      - name: Export original source file from base branch
        run: git show origin/main:src/main.py > original_main.py

      # Produce a unified diff scoped to the single file under review.
      - name: Generate single-file unified diff
        run: git diff origin/main...HEAD -- src/main.py > patch.diff

      # Run agent-eval. Exit code 1 automatically fails the PR check.
      - name: Validate diff with agent-eval
        uses: ./   # or: uses: your-org/agent-eval-mcp@v1
        with:
          diff_file: patch.diff
          source_file: original_main.py
          language: python
          patterns: |
            print\(.*\)
            TODO|FIXME

Action Inputs

Input	Required	Default	Description
`diff_file`	Yes	—	Path to the unified diff file (single-file `git diff` output)
`source_file`	Yes	—	Path to the original source file before the diff
`language`	No	`python`	Language ruleset: `java`, `python`, `typescript`, `go`
`patterns`	No	—	Newline-separated regex patterns to reject in added lines

Telemetry & Dashboards

agent-eval can optionally stream every evaluation result to a Supabase PostgreSQL database, enabling a central dashboard for tracking AI code-quality trends across all repositories in your organization.

The feature is completely opt-in and non-blocking:

If any of the required environment variables are absent, no data is sent and the CLI behaves identically.
Telemetry is sent from a detached background process so it never adds latency to the CLI exit. Network failures and other errors are silently discarded — a broken telemetry path will never fail the pipeline or appear in CI output.

Activation

Set the following environment variables in your CI/CD runner or GitHub Actions secret store:

Variable	Required	Description
`SUPABASE_URL`	Yes	Base URL of your Supabase project (e.g. `https://xxxx.supabase.co`)
`SUPABASE_KEY`	Yes	Anon/public API key for your Supabase project
`SENTINEL_ORG_ID`	Yes	UUID of the organization that owns this pipeline

The following variable is read automatically from the GitHub Actions environment and does not need to be configured manually:

Variable	Default	Description
`GITHUB_REPOSITORY`	`local-dev`	Repository name (`owner/repo`), set automatically by GitHub Actions

Example — GitHub Actions

- name: Validate diff with agent-eval
  env:
    SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
    SUPABASE_KEY: ${{ secrets.SUPABASE_KEY }}
    SENTINEL_ORG_ID: ${{ secrets.SENTINEL_ORG_ID }}
  run: |
    agent-eval --diff-file patch.diff --source-file original.py --language python

Each evaluation writes one row to the evaluations table with the fields: org_id, repository, file_path, language, accepted, stage_failed, rejection_reason, and timestamp (ISO-8601 UTC).

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nikomo750

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Mar 26, 2026

0.1.0

Mar 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_eval_mcp-0.2.0.tar.gz (42.4 kB view details)

Uploaded Mar 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_eval_mcp-0.2.0-py3-none-any.whl (31.7 kB view details)

Uploaded Mar 26, 2026 Python 3

File details

Details for the file agent_eval_mcp-0.2.0.tar.gz.

File metadata

Download URL: agent_eval_mcp-0.2.0.tar.gz
Upload date: Mar 26, 2026
Size: 42.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agent_eval_mcp-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`f9e0184106ee7eb94d1300912a759521b0b6ee4678385a10907ad1fdafefec75`
MD5	`14d3d1be776b686d85b9ea167cd8adb0`
BLAKE2b-256	`c514a56704b148f41b16d893be8926b0108795257e5fcf5552eaf32f7c62b91b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_eval_mcp-0.2.0.tar.gz:

Publisher: publish.yml on nicolaemorcov/agent-eval-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent_eval_mcp-0.2.0.tar.gz
- Subject digest: f9e0184106ee7eb94d1300912a759521b0b6ee4678385a10907ad1fdafefec75
- Sigstore transparency entry: 1186486980
- Sigstore integration time: Mar 26, 2026
Source repository:
- Permalink: nicolaemorcov/agent-eval-mcp@a62d0d8db411c1415bfa4726226ea6f38b102d97
- Branch / Tag: refs/heads/main
- Owner: https://github.com/nicolaemorcov
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a62d0d8db411c1415bfa4726226ea6f38b102d97
- Trigger Event: workflow_dispatch

File details

Details for the file agent_eval_mcp-0.2.0-py3-none-any.whl.

File metadata

Download URL: agent_eval_mcp-0.2.0-py3-none-any.whl
Upload date: Mar 26, 2026
Size: 31.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agent_eval_mcp-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`95a3183bcb2708d0cebf63f194fa25b44071c42ecd70430e3c74572699d36c67`
MD5	`82fb7bb7c4562f9c83aaf239ab073999`
BLAKE2b-256	`dfe031e67dcc598f609b0e0f7d52f4aadc7fa37edcb1a9b9030ca90e0941f720`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_eval_mcp-0.2.0-py3-none-any.whl:

Publisher: publish.yml on nicolaemorcov/agent-eval-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent_eval_mcp-0.2.0-py3-none-any.whl
- Subject digest: 95a3183bcb2708d0cebf63f194fa25b44071c42ecd70430e3c74572699d36c67
- Sigstore transparency entry: 1186486981
- Sigstore integration time: Mar 26, 2026
Source repository:
- Permalink: nicolaemorcov/agent-eval-mcp@a62d0d8db411c1415bfa4726226ea6f38b102d97
- Branch / Tag: refs/heads/main
- Owner: https://github.com/nicolaemorcov
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a62d0d8db411c1415bfa4726226ea6f38b102d97
- Trigger Event: workflow_dispatch

agent-eval-mcp 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

🛡️ agent-eval-mcp

⚠️ The Problem

💡 The Solution

🔧 Available Tools

verify_fix

verify_unified_diff

evaluate_revision_quality

🚀 Quickstart

1. Install the Package

2. Configure your MCP client

3. Use in your agentic workflow

🖥️ CLI Usage (GitHub Actions / CI/CD)

Install

Basic usage

With custom rejection patterns

Example output

GitHub Actions example

Arguments

GitHub Action (CI/CD)

Usage

Action Inputs

Telemetry & Dashboards

Activation

Example — GitHub Actions

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`verify_fix`

`verify_unified_diff`

`evaluate_revision_quality`