Enterprise-grade reliability framework for AI agents.

These details have not been verified by PyPI

Intended Audience
- Developers
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

🤖 MultiAgentEval - The Enterprise-Grade Reliability Framework for AI Agents

MultiAgentEval bridges the "Agentic Reliability Gap" through rigorous evaluation, deep-trace replay debugging, and a modular 20-Shim Enterprise Suite for high-fidelity environment simulation.

Attribute	Specification
Architect	Najeed Khan
License	Apache License 2.0
Status	Stable Framework v1.1
Core Goal	Eliminating the "Agentic Reliability Gap"
Quick Links	Quickstart • Advanced Update • Architecture • Security • Editions

🛡️ Add the Badge to Your Agent

Showcase your agent's rigorous reliability by adding the official Works with MultiAgentEval badge to your repository to show that it has been evaluated by the MultiAgentEval framework.

Option 1: Using img.shields.io

You can use the Shields.io service to generate a consistent badge for your project:

[![Works with MultiAgentEval](https://img.shields.io/badge/Works%20with-MultiAgentEval-2c62c7)](https://github.com/najeed/ai-agent-eval-harness)

Option 2: Using GitHub Asset

Alternatively, link directly to our high-fidelity SVG asset:

[![Works with MultiAgentEval](https://raw.githubusercontent.com/najeed/ai-agent-eval-harness/main/docs/assets/badges/works-with-multiagenteval.svg)](https://github.com/najeed/ai-agent-eval-harness)

Mission
Getting Started
At a Glance (v1.0 RC)
The Advanced Update (v1.1)
Zero-Touch Core Architecture
- Advanced Utilities
Integrated Visual Suite (Native GUI)
Security and Governance
Troubleshooting
How to Contribute
Licensing and Editions

TL;DR: Impact in 60s

Get from zero to evaluated in seconds:

pip install -e .
multiagent-eval quickstart

Result: Launches mock agent, executes a telecom scenario, and builds a report.
Next Step: multiagent-eval console for the visual dashboard.

Mission

Our goal is to create a standardized, community-driven benchmark for AI agent performance. By providing a rich set of industry-specific scenarios and a flexible evaluation runner, we aim to help developers, researchers, and businesses measure and improve their agent-based systems.

The harness is organized into the following key components:

/dataproc_engine: High-fidelity industrial data extraction engine (8 Sectors, Gold Standards).
/industries: Evaluation assets (5,000+ scenarios) categorized by 45+ industries.
/eval_runner: Modular Core Engine (Multi-turn loop, Sandbox, Metrics, Simulators, Mutator).
/eval_runner/console: Flask-based REST API for the Integrated Visual Suite.
/ui/visual-debugger: Premium React-based Visual Debugger & Dashboard.
/examples: Sample drift traces and triage scenarios for rapid onboarding.
/reports: Generated artifacts (JSONL, trajectories, HTML heatmaps).
/runs: Local execution history (Flight Recorder logs).
/spec/aes: Agent Eval Specification (Foundational) - Benchmark standard.
/schemas: JSON Schema definitions for cross-platform scenario validation.
/docs: Deep-dive guides, architecture, and API specifications.
/tests: Comprehensive test suite (Unit, Integration, and Red-Teaming).
/sample_agent: Reference implementation for benchmark testing.

Getting Started

Prerequisites

Python 3.8+
pip
Docker & Docker Compose (optional, for Lab Mode):
- Windows/Mac: Install Docker Desktop.
- Linux: Install Docker Engine and Docker Compose.

[!IMPORTANT]

60-Second Quickstart (Get Running Now)

The fastest way to see the harness in action:
# 1. Clone the repository
git clone https://github.com/najeed/ai-agent-eval-harness.git
cd ai-agent-eval-harness

# 2. Set up a virtual environment (Recommended)
python -m venv venv
venv\Scripts\activate  # On Windows
# source venv/bin/activate  # On macOS/Linux

# 3. Install the package in editable mode
pip install -e .

# 4. Run the Quickstart Demo (CLI)
multiagent-eval quickstart
What it does: Spawns a mock sample agent, runs a troubleshooting evaluation, and generates a rich legacy HTML report in reports/.

[!TIP] Prefer a visual experience? After running the quickstart, launch the Integrated Visual Suite to replay the trace interactively: multiagent-eval console.

📂 The Global Scenario Corpus (v1.1)

The harness now ships with a massive, validated corpus of 5,000+ scenarios designed to stress-test agents across every dimension:

🏛️ Industry-Specific (4,000+ Scenarios)

Comprehensive coverage for 50+ sectors including:

Finance & Banking: Loan processing, fraud detection, and regulatory audits.
Healthcare: PII handling, insurance reconciliation, and diagnostic workflows.
Telecom & Energy: Network troubleshooting, grid optimization, and billing.

🧠 Advanced Categories (1,000+ Scenarios)

Cross-Industry Negotiation: Scenarios where agents must bridge data and policy gaps between two distinct sectors (e.g., Legal & Healthcare).
Ethical & Safety Guardrails: Hardened tests for PII leakage, prompt injection, and bias.
Interactive Complexity: Multi-turn flows involving conflicting human-in-the-loop (HITL) requirements.
Simulations: High-fidelity sector labs (e.g., Bank, EHR/HL7, CRM) for testing agents in realistic, isolated environments.

All scenarios are 100% compliant with the AES Specification.

(Optional Full Lab Mode): For the complete dashboard and database experience, you can use docker compose up --build. If you don't have Docker, you can run services manually (see Troubleshooting).

Manual Evaluation (Running the Sample Agent)

Start your Agent: The framework includes a reference agent for testing.
```
python sample_agent/agent_app.py
```

Set Endpoint: Point the harness to your agent's webhook.

set AGENT_API_URL=http://localhost:5001/execute_task   # Windows
export AGENT_API_URL=http://localhost:5001/execute_task # Mac/Linux

Run Evaluation:

# Standard HTTP (default)
multiagent-eval evaluate --path industries/telecom

# Local Subprocess (stdin/stdout)
multiagent-eval evaluate --path my_scenarios/ --protocol local --agent-cmd "python my_agent.py"

# Socket (TCP/Unix)
multiagent-eval evaluate --path tests/scenarios --protocol socket --agent-socket "localhost:9000"

[!NOTE] Path Decoupling: The harness now supports ad-hoc evaluations anywhere on your filesystem. Metadata like industry is inferred from the file content or folder structure, defaulting to local and unclassified for external files.

Agent Communication Protocols

The harness supports multiple ways to talk to your agent, enabling seamless integration with local scripts, legacy binaries, or remote services.

Protocol	Description	Configuration Flag	Env Variable
HTTP	Standard REST API (POST)	`(default)`	`AGENT_API_URL`
Local	Local process via stdin/stdout	`--agent-cmd`	`AGENT_LOCAL_CMD`
Socket	TCP or Unix Domain Socket	`--agent-socket`	`AGENT_SOCKET_ADDR`

At a Glance (v1.0 RC)

Evaluation Specification (AES): Standardized YAML/Markdown benchmarks for agents.
20-Shim Enterprise Suite: High-fidelity simulators for Git, API, Database, Knowledge Base, Support Desk, Social Media, Vector DB, CI/CD, IoT, Security, and more.
Zero-Touch Hot-Swap Architecture: Dynamically register and swap simulators via plugins without core code modifications.
Benchmark Ecosystem: Native loaders for GAIA (HuggingFace Integration) and AssistantBench. Supports benchmark URI schemes (e.g., gaia://2023, assistantbench://v1) for zero-config execution.
High-Fidelity Industry Metrics: Modular, pluggable evaluators for Defense (ROE, C2, Intelligence Fusion), Healthcare, and Finance. Features high-precision numerical extraction and domain-specific LLM rubrics.
Tool Sandbox: Governance-controlled execution with full VFS-aware state parity verification.
Integrated Visual Suite: Unified React dashboard for live trace replay and visual debugging.
Semantic Bridge: Ingest production traces (import-drift) and analyze failures (triage).
Judge Guarding: Model-based scoring with support for OpenAI, Gemini, Claude, and Ollama.

The Advanced Update (v1.1)

The latest release introduces a new suite of high-level automation and visual tools designed for 10x developer productivity.

Advanced CLI Suite

list: Faceted search filtering across 5,000+ industry scenarios.
lint: Automated quality scoring and AES compliance verification via --path.
install <pack>: Rapid deployment of curated, industry-specific scenario bundles (e.g., telecom-pack, rag-agent-pack).
analyze <url>: Proactive agent repo scanning; auto-generates AES stubs by identifying tools and endpoints.
ci generate: One-click scaffolding of GitHub Actions workflows for evaluation-on-PR.
failures search: Intelligence-driven retrieval of edge cases from the global failure corpus.
explain: AI-powered trace diagnostics (loops, timeouts, PII leaks) via --path <run.jsonl>.
auto-translate: Leverage local LLMs (via Ollama) to convert raw documents into executable AES scenarios.

Premium UX Tools

Scenario Editor: A visual interface for constructing real-world AES logic; saves production-ready JSON directly to the catalog.
VS Code Extension: Run evaluations and visualize timelines directly within your IDE.
Visual Debugger: Real-time trajectory playback with interactive state inspection (Live Engine Hook).

The harness is built on a decoupled, event-driven architecture that allows Enterprise integrations to be hot-swapped without core modifications.

EventEmitter Bus: Passive observation of every turn, tool call, and state change.
🧩 Pluggable Judge Layer: Configurable model-based scoring with support for OpenAI, Gemini, Claude, and Ollama.
🏥 High-Fidelity Metrics Framework: Decoupled, category-based evaluators (Accuracy, Planning, Defense, Technical) with extensible registration.
Industry-Standard Rubrics: Specialized evaluators for Clinical Safety, Fiduciary Accuracy, Strategic Planning, and Causal Inference.
Native HITL Support: built-in pausing for human intervention via the human adapter.
Advanced Discovery: Plugin-driven registry for third-party agent adapters (LangGraph, CrewAI, AutoGen, Grok).
Pluggable World Shims: Register custom environment simulators through the on_register_simulators hook.

🛠️ dataproc-engine: Industrial Extraction Core

The framework now features a standalone extraction engine designed for high-fidelity data acquisition:

8-Sector Coverage: Finance, Healthcare, Energy, Telecom, Ecommerce, Agriculture, Transportation, and Unstructured.
Zero-Mock Integrity: Automated fallback to high-fidelity simulations when live APIs are unavailable, maintaining 100% data availability.

Beyond the advanced suite, the harness provides a robust toolkit for professional evaluation:

doctor: Environment health checker.
report: Rich HTML reporting with interactive Mermaid trajectories via --path.
record & playground: Interaction capture and REPL experimentation for rapid prototyping.
spec-to-eval: Convert Markdown PRDs/Specs into executable JSON scenarios. Supports --fill-defaults to rapidly generate lint-compliant stubs.
scenario generate: Interactive scaffolding for manual test authoring.
mutate: Adversarial scenario generator (typos, injections, ambiguity).
import-drift: Convert production logs into regression test cases.

Integrated Visual Suite (Native GUI)

The harness includes a unified React-powered SPA that simplifies management of scenarios, runs, and visual debugging across all industries.

Key Feature Hubs:

Scenario Explorer: Browse the catalog with faceted filters, global search, and real-time Quality Badges (Lint scores).
Visual AES Builder: Construction of complex agentic evaluation sequences using a drag-and-drop node logic—outputs production-ready JSON.
Reports & Traces Hub: Historical execution timeline with detailed analysis and instant "View Report" navigation.
Interactive Visual Debugger: Real-time trajectory playback, state inspection, and trace export (JSON) for regression testing.
Documentation Hub: Categorized access to all Markdown guides, architectural diagrams, and API references.

Quick Launch:

multiagent-eval console

Access via browser at http://localhost:5000. The console features an adaptive, premium dark-mode UI with high-density data visualizations.

Running Tests

python -m pytest

Centralized Configuration

All configurable parameters are centralized in eval_runner/config.py. You can override any setting via environment variables.

Variable	Default	Description
`AGENT_API_URL`	`http://localhost:5001/execute_task`	Agent entry point URL (HTTP)
`EVAL_MAX_TURNS`	`5`	Max conversation turns per task
`MAX_ENGINE_ATTEMPTS`	`50`	Security cap on evaluation attempts
`JUDGE_PROVIDER`	`ollama`	LLM Judge provider (`openai`, `anthropic`, `gemini`, `ollama`, `grok`)
`JUDGE_MODEL`	(None)	Specific model for the judge
`LUNA_JUDGE_TEMPERATURE`	`0.0`	Temperature for judge generation
`OLLAMA_HOST`	`http://localhost:11434`	Local Ollama host URL
`OLLAMA_MODEL`	`llama3`	Default Ollama model
`OPENAI_API_KEY`	(None)	API key for OpenAI provider
`OPENAI_BASE_URL`	`https://api.openai.com/v1`	Base URL for OpenAI-compatible APIs
`ANTHROPIC_API_KEY`	(None)	API key for Anthropic/Claude provider
`GOOGLE_API_KEY`	(None)	API key for Google/Gemini provider
`XAI_API_KEY`	(None)	API key for xAI/Grok provider
`DEFAULT_ADAPTER_TIMEOUT`	`30`	Network timeout for agent adapters
`PLUGIN_TIMEOUT`	`5.0`	Execution timeout for plugin hooks
`REPORTS_DIR`	`reports`	Base directory for generated reports

... and more.

Security and Governance (Audit-Ready)

The platform is built with a Secure-by-Design philosophy, complying with enterprise-grade audit standards.

PII/Secret Redaction: Automatic, recursive scanning and redaction of JWTs, AWS keys, and PII from all event logs.
Secure Handoff Architecture: JWT-based authentication for between the core console and enterprise plugins.
Tool Sandboxing: Path traversal protection and shell-character neutralization for all tool executions.
WORM Logs: Write-Once-Read-Many immutable flight recorder traces (run.jsonl).
Audit Points: 100% compliance with the 8-point Enterprise Security Audit (DoS caps, Fork Bomb prevention, RCE guards).

Run Trace Warning

All evaluation execution logs are appended to runs/run.jsonl. Because this acts as an immutable flight recorder, the file will grow continuously. It is recommended to use the built-in trace rotation or periodically clean up this directory via multiagent-eval cleanup-runs --days 7.

Troubleshooting

ConnectionRefusedError: The harness cannot reach the agent. Ensure AGENT_API_URL is set correctly and the agent API is running.
PluginTimeoutError: A registered plugin took too long to execute a hook. Check your plugin logic or increase the timeout.
Invalid JSON Error (LLM): The auto-translate command expects strict JSON. Ensure your local Ollama model (e.g., llama3) is running and capable of JSON mode.
docker: command not found: You need to install Docker. Follow the Official Installation Guide.
Running Lab Mode without Docker: If you cannot install Docker, run these 3 commands in separate terminals:
1. python sample_agent/agent_app.py
2. multiagent-eval console
3. streamlit run dashboard/app.py (requires pip install streamlit)

How to Contribute

This is a community-driven project, and we welcome contributions! Please see our CONTRIBUTING.md file for detailed guidelines on how to add new industries, scenarios, or improve the evaluation engine.

Here are ways to get involved:

🌟 Quick Contributions

⭐ Star this repository
🐛 Report bugs
💡 Suggest features
📖 Improve documentation

🔨 Code Contributions

🆕 Good first issues
🧪 Add test scenarios
⚖️ Zero-CLA: We use the DCO (Developer Certificate of Origin). Just git commit -s.
🏭 Contribute new industries
🎯 Create evaluation scenarios

Contributors

Thanks to all our contributors! 🙌

Licensing and Editions

This project follows an Open Core model. The open-source capabilities provide a robust evaluation foundation, while the Enterprise Edition delivers the necessary security, governance, and audit-grade intelligence required for regulated deployments.

Feature Module	Community Edition (OSS)	Enterprise Edition
Core Architecture	✅ Eval Engine, Hooks, JSON Schemas	✅ Enterprise Service Bus Integration
Industry Benchmark Set	✅ 5,000+ Scenarios	✅ Prioritized Scenario Updates
Reliability Metrics	✅ `pass@k` multi-attempt scoring	✅ Persistent Leaderboards & Consensus
Scenario Mutations	🔶 Basic (Typos & Ambiguity)	✅ Adversarial Fuzzing & Prompt Injections
Execution Security	🔶 Basic Path/Shell Gating	✅ Context Payload Caps & Overflow Guards
Privacy Protections	❌ No	✅ Automatic PII Scanning & Redaction
Simulation	🔶 Real API required	✅ High-Fidelity Labs (Bank, EHR/HL7, CRM)
Compliance Suites	❌ No	✅ Production-Ready (HIPAA, FINRA, GDPR, PCI)
Observability	🔶 Terminal output	✅ OTEL Drift Gauges & Dashboard Feed
Defensibility Governance	❌ No	✅ WORM Audit Logs & Cryptographic Sealing
Integrity Checks	❌ No	✅ AES Scenario Merkle Sync (Root Verify)
Visual Debugger & GUI	✅ Local React Native App	✅ Enterprise Dashboard & Secure Handoff
Reproduction Workflow	🔶 JSONL Only	✅ Interactive Flight Recorder & Jupyter Repro
Parallel Engine	🔶 Sequential only	✅ Ray/Local JobQueue Distributed Runs
Interactive Triage	🔶 Heuristic only	✅ Multi-user Sync & Human Annotation
Advanced Sandbox	🔶 Path/Shell Gating	✅ Hardened Docker Isolation & Red-Team Probes
Auth & Governance	❌ None	✅ OIDC SSO, RBAC, Managed Leaderboards

Legend: ✅ Full Capability • 🔶 Basic/OSS Only • ❌ Enterprise Only

Looking for Production-Grade Reliability? The Enterprise Edition guarantees that you can safely evaluate agents over sensitive datasets without exposing credentials or executing dangerous code, backed by mathematical proof of non-repudiation. Contact ai.eval.harness.contact+enterprise@gmail.com.

License

The core of this project is licensed under the Apache License 2.0. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Intended Audience
- Developers
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

This version

1.1.0

Apr 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multiagent_verify-1.1.0.tar.gz (163.9 kB view details)

Uploaded Apr 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

multiagent_verify-1.1.0-py3-none-any.whl (107.3 kB view details)

Uploaded Apr 2, 2026 Python 3

File details

Details for the file multiagent_verify-1.1.0.tar.gz.

File metadata

Download URL: multiagent_verify-1.1.0.tar.gz
Upload date: Apr 2, 2026
Size: 163.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for multiagent_verify-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3225cd24e9d3eea171d93f9fb25e16a87f5685c7ee4ad46b465ca0c94ac0e162`
MD5	`2ab0a6f6f052e114919b65ecef0f75ea`
BLAKE2b-256	`1776d4851c53683134cf2c071ff847acbf8bb59d80d3fc217219d3e57e99f050`

See more details on using hashes here.

File details

Details for the file multiagent_verify-1.1.0-py3-none-any.whl.

File metadata

Download URL: multiagent_verify-1.1.0-py3-none-any.whl
Upload date: Apr 2, 2026
Size: 107.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for multiagent_verify-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dba8331e790249c5ea0b6cd5eb1803512a2e670805e9ac53afdcf357d7174d65`
MD5	`45c6c3848913758c138ec36b7b8593ee`
BLAKE2b-256	`12106f922ffe2c7f145df2a6b5493c1f0ce8323287ead6a4be79f24c370358b8`

See more details on using hashes here.

multiagent-verify 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

🤖 MultiAgentEval - The Enterprise-Grade Reliability Framework for AI Agents

🛡️ Add the Badge to Your Agent

Option 1: Using img.shields.io

Option 2: Using GitHub Asset

Table of Contents

TL;DR: Impact in 60s

Mission

Getting Started

Prerequisites

60-Second Quickstart (Get Running Now)

📂 The Global Scenario Corpus (v1.1)

🏛️ Industry-Specific (4,000+ Scenarios)

🧠 Advanced Categories (1,000+ Scenarios)

Manual Evaluation (Running the Sample Agent)

Agent Communication Protocols

At a Glance (v1.0 RC)

The Advanced Update (v1.1)

Advanced CLI Suite

Premium UX Tools

🛠️ dataproc-engine: Industrial Extraction Core

Integrated Visual Suite (Native GUI)

Running Tests

Centralized Configuration

Security and Governance (Audit-Ready)

Run Trace Warning

Troubleshooting

How to Contribute

🌟 Quick Contributions

🔨 Code Contributions

Contributors

Licensing and Editions

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes