Open Source Multi-Agent System Evaluation Framework
Project description
multiagent-eval
Propagation-aware evaluation for multi-agent AI systems.
Single-LLM eval tools (RAGAS, DeepEval) miss what actually breaks in production: errors that start in Agent 1, silently propagate through Agent 3, and surface in the final output with no trace of origin.
multiagent-eval finds where the fault began.
The Problem Nobody Is Solving
Agent 1 ──► Agent 2 ──► Agent 3 ──► Agent 4 (Writer)
│ │ │ │
✓ ✗ ✗ ✗ ← Error propagates; eval sees only final failure
You run eval. Score looks good. You ship.
Three days later: a hallucination in production. You check your eval results. Everything passed.
What happened?
Your eval checked the final output. It didn't check whether Agent 2 silently corrupted the information Agent 1 found. It didn't check whether Agent 3's hallucination was its own fault, or the result of broken input from upstream.
That's the gap multiagent-eval closes.
Quickstart
git clone https://github.com/iremsusavas/multiagent-eval.git
cd multiagent-eval
pip install -e .
# Zero-dependency demo — no API key needed
python examples/quickstart_mock.py
LLM-based evaluation requires a running LLM. Supports OpenAI, Anthropic, or local models via Ollama (no API key needed):
ollama pull llama3.2Then in
eval_config.yaml:judge: primary_model: "ollama/llama3.2" api_base: "http://localhost:11434"For a fully zero-dependency demo (no LLM needed):
python examples/quickstart_mock.py
What Makes This Different
Propagation Judge
Detects where information corruption begins — not just that it happened. Builds a directed graph where each edge carries a fidelity score. Red edges show exactly where data was lost or distorted between agents.
Built-in Bias Detection
Every LLM judge call automatically runs:
- Primacy bias (A/B swap permutation tests)
- Verbosity bias (length vs. correctness)
- Tone bias (neutral vs. apologetic framing)
- Cascade bias (upstream error penalizing innocent agents)
CI/CD Native
Eval isn't a report. It's a gate.
# .github/workflows/multiagent-eval.yml
- name: Run evaluation
run: multiagent-eval run --config eval_config.yaml
# eval_score < threshold → fail the PR
Statistical Rigor
Bootstrap confidence intervals and permutation p-values on every run. "Did we improve?" becomes answerable.
Failure Mode Taxonomy
Not just a score. A category:
PROPAGATION_ERROR | HALLUCINATION | CONTEXT_LOSS |
ORCHESTRATION_BREAK | CASCADE_FAILURE | PII_LEAKAGE
Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ multiagent-eval │
├─────────────────────────────────────────────────────────────────────────┤
│ core/ trace, metrics, runner, state_machine, LLMGateway │
│ judges/ LLMJudge (CoT, bias), ConsistencyJudge, PropagationJudge│
│ bias_detection/ primacy, verbosity, tone, cascade │
│ golden_datasets/ schema, manager, annotator, inter-rater agreement │
│ reports/ JSON, HTML (D3.js), Streamlit dashboard │
│ integrations/ LangGraph, CrewAI, AutoGen, Custom adapters │
│ telemetry/ OpenTelemetry spans → Datadog, Grafana, Jaeger │
└─────────────────────────────────────────────────────────────────────────┘
Integrations
| LangGraph | CrewAI | AutoGen | Custom |
Production Features
- OpenTelemetry: Real-time span emission to Datadog/Grafana/Jaeger
- PII Detection: Email, SSN, credit card — zero-tolerance config
- Prompt Injection Detection: Pattern-based, extensible
- Cost Estimation: Know your budget before you run (
estimate-cost --dataset ...) - Regression Testing: Which examples degraded between v1.1 and v1.2? (
regression-diff)
CLI
multiagent-eval run --config eval_config.yaml
multiagent-eval run --all # All examples in golden dataset
multiagent-eval estimate-cost -d datasets/research_qa.json
multiagent-eval regression-diff -a result_v1.json -b result_v2.json
multiagent-eval report --input results.json --format html
multiagent-eval dataset add --name my_dataset
multiagent-eval dashboard
Background
Built by an ML engineer who spent months improving LLM-as-Judge agreement from 63% to 84% in production at Pipedrive — and discovered that most eval problems aren't scoring problems. They're architectural ones.
JudgeGuard (primacy bias detection) came first. multiagent-eval is what came after asking: "What happens to these biases when you have five agents?"
Roadmap
- Leaderboard / MAE-Bench public benchmark
- Multi-turn stateful session evaluation
- Visual diff UI for agent output comparison
- Automated rubric improvement suggestions
- Native LangSmith integration
Contributing
Issues, PRs, and dataset contributions welcome. If you're building multi-agent systems and hitting eval problems — open an issue. That's how this gets better.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file multiagent_eval-0.1.0.tar.gz.
File metadata
- Download URL: multiagent_eval-0.1.0.tar.gz
- Upload date:
- Size: 56.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
12901d17a942114441ed408ffd6ef99b832446ce356359ca7f4399bd2e4ee559
|
|
| MD5 |
4d44526f49680469e92146574f74070e
|
|
| BLAKE2b-256 |
3ed5684b5c80398a6eea2fbd435fb42fa45b70089c51164e786e47b28f2ed360
|
File details
Details for the file multiagent_eval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: multiagent_eval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 67.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da34c538648f1652532dbc8d9d21619adeb2344a69bc41ec85aedc51dd161195
|
|
| MD5 |
8aab5d08224773c28f2ca273dd811f52
|
|
| BLAKE2b-256 |
dbe4c70cba0fea94a81e2cc45976e133a32619734120c5d7549eae729f7e021b
|