Open Source Multi-Agent System Evaluation Framework

These details have not been verified by PyPI

Project links

Project description

multiagent-eval

Propagation-aware evaluation for multi-agent AI systems.

Single-LLM eval tools (RAGAS, DeepEval) miss what actually breaks in production: errors that start in Agent 1, silently propagate through Agent 3, and surface in the final output with no trace of origin.

multiagent-eval finds where the fault began.

The Problem Nobody Is Solving

Agent 1 ──► Agent 2 ──► Agent 3 ──► Agent 4 (Writer)
   │           │           │            │
   ✓           ✗           ✗            ✗  ← Error propagates; eval sees only final failure

You run eval. Score looks good. You ship.

Three days later: a hallucination in production. You check your eval results. Everything passed.

What happened?

Your eval checked the final output. It didn't check whether Agent 2 silently corrupted the information Agent 1 found. It didn't check whether Agent 3's hallucination was its own fault, or the result of broken input from upstream.

That's the gap multiagent-eval closes.

Quickstart

git clone https://github.com/iremsusavas/multiagent-eval.git
cd multiagent-eval
pip install -e .

# Zero-dependency demo — no API key needed
python examples/quickstart_mock.py

LLM-based evaluation requires a running LLM. Supports OpenAI, Anthropic, or local models via Ollama (no API key needed):
ollama pull llama3.2
Then in eval_config.yaml:
judge:
  primary_model: "ollama/llama3.2"
  api_base: "http://localhost:11434"
For a fully zero-dependency demo (no LLM needed):
python examples/quickstart_mock.py

What Makes This Different

Propagation Judge

Detects where information corruption begins — not just that it happened. Builds a directed graph where each edge carries a fidelity score. Red edges show exactly where data was lost or distorted between agents.

Built-in Bias Detection

Every LLM judge call automatically runs:

Primacy bias (A/B swap permutation tests)
Verbosity bias (length vs. correctness)
Tone bias (neutral vs. apologetic framing)
Cascade bias (upstream error penalizing innocent agents)

CI/CD Native

Eval isn't a report. It's a gate.

# .github/workflows/multiagent-eval.yml
- name: Run evaluation
  run: multiagent-eval run --config eval_config.yaml
# eval_score < threshold → fail the PR

Statistical Rigor

Bootstrap confidence intervals and permutation p-values on every run. "Did we improve?" becomes answerable.

Failure Mode Taxonomy

Not just a score. A category:

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         multiagent-eval                                  │
├─────────────────────────────────────────────────────────────────────────┤
│  core/           trace, metrics, runner, state_machine, LLMGateway        │
│  judges/         LLMJudge (CoT, bias), ConsistencyJudge, PropagationJudge│
│  bias_detection/ primacy, verbosity, tone, cascade                       │
│  golden_datasets/ schema, manager, annotator, inter-rater agreement       │
│  reports/        JSON, HTML (D3.js), Streamlit dashboard                  │
│  integrations/   LangGraph, CrewAI, AutoGen, Custom adapters               │
│  telemetry/      OpenTelemetry spans → Datadog, Grafana, Jaeger          │
└─────────────────────────────────────────────────────────────────────────┘

Integrations

Production Features

OpenTelemetry: Real-time span emission to Datadog/Grafana/Jaeger
PII Detection: Email, SSN, credit card — zero-tolerance config
Prompt Injection Detection: Pattern-based, extensible
Cost Estimation: Know your budget before you run (estimate-cost --dataset ...)
Regression Testing: Which examples degraded between v1.1 and v1.2? (regression-diff)

CLI

multiagent-eval run --config eval_config.yaml
multiagent-eval run --all                    # All examples in golden dataset
multiagent-eval estimate-cost -d datasets/research_qa.json
multiagent-eval regression-diff -a result_v1.json -b result_v2.json
multiagent-eval report --input results.json --format html
multiagent-eval dataset add --name my_dataset
multiagent-eval dashboard

Background

Built by an ML engineer who spent months improving LLM-as-Judge agreement from 63% to 84% in production at Pipedrive — and discovered that most eval problems aren't scoring problems. They're architectural ones.

JudgeGuard (primacy bias detection) came first. multiagent-eval is what came after asking: "What happens to these biases when you have five agents?"

Roadmap

Leaderboard / MAE-Bench public benchmark
Multi-turn stateful session evaluation
Visual diff UI for agent output comparison
Automated rubric improvement suggestions
Native LangSmith integration

Contributing

Issues, PRs, and dataset contributions welcome. If you're building multi-agent systems and hitting eval problems — open an issue. That's how this gets better.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multiagent_eval-0.1.0.tar.gz (56.6 kB view details)

Uploaded Mar 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

multiagent_eval-0.1.0-py3-none-any.whl (67.0 kB view details)

Uploaded Mar 3, 2026 Python 3

File details

Details for the file multiagent_eval-0.1.0.tar.gz.

File metadata

Download URL: multiagent_eval-0.1.0.tar.gz
Upload date: Mar 3, 2026
Size: 56.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for multiagent_eval-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`12901d17a942114441ed408ffd6ef99b832446ce356359ca7f4399bd2e4ee559`
MD5	`4d44526f49680469e92146574f74070e`
BLAKE2b-256	`3ed5684b5c80398a6eea2fbd435fb42fa45b70089c51164e786e47b28f2ed360`

See more details on using hashes here.

File details

Details for the file multiagent_eval-0.1.0-py3-none-any.whl.

File metadata

Download URL: multiagent_eval-0.1.0-py3-none-any.whl
Upload date: Mar 3, 2026
Size: 67.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for multiagent_eval-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`da34c538648f1652532dbc8d9d21619adeb2344a69bc41ec85aedc51dd161195`
MD5	`8aab5d08224773c28f2ca273dd811f52`
BLAKE2b-256	`dbe4c70cba0fea94a81e2cc45976e133a32619734120c5d7549eae729f7e021b`

See more details on using hashes here.

multiagent-eval 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

multiagent-eval

The Problem Nobody Is Solving

Quickstart

What Makes This Different

Propagation Judge

Built-in Bias Detection

CI/CD Native

Statistical Rigor

Failure Mode Taxonomy

Architecture

Integrations

Production Features

CLI

Background

Roadmap

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes