Skip to main content

Open Source Multi-Agent System Evaluation Framework

Project description

multiagent-eval

Propagation-aware evaluation for multi-agent AI systems.

Single-LLM eval tools (RAGAS, DeepEval) miss what actually breaks in production: errors that start in Agent 1, silently propagate through Agent 3, and surface in the final output with no trace of origin.

multiagent-eval finds where the fault began.


The Problem Nobody Is Solving

Agent 1 ──► Agent 2 ──► Agent 3 ──► Agent 4 (Writer)
   │           │           │            │
   ✓           ✗           ✗            ✗  ← Error propagates; eval sees only final failure

You run eval. Score looks good. You ship.

Three days later: a hallucination in production. You check your eval results. Everything passed.

What happened?

Your eval checked the final output. It didn't check whether Agent 2 silently corrupted the information Agent 1 found. It didn't check whether Agent 3's hallucination was its own fault, or the result of broken input from upstream.

That's the gap multiagent-eval closes.


Quickstart

git clone https://github.com/iremsusavas/multiagent-eval.git
cd multiagent-eval
pip install -e .

# Zero-dependency demo — no API key needed
python examples/quickstart_mock.py

LLM-based evaluation requires a running LLM. Supports OpenAI, Anthropic, or local models via Ollama (no API key needed):

ollama pull llama3.2

Then in eval_config.yaml:

judge:
  primary_model: "ollama/llama3.2"
  api_base: "http://localhost:11434"

For a fully zero-dependency demo (no LLM needed):

python examples/quickstart_mock.py

What Makes This Different

Propagation Judge

Detects where information corruption begins — not just that it happened. Builds a directed graph where each edge carries a fidelity score. Red edges show exactly where data was lost or distorted between agents.

Built-in Bias Detection

Every LLM judge call automatically runs:

  • Primacy bias (A/B swap permutation tests)
  • Verbosity bias (length vs. correctness)
  • Tone bias (neutral vs. apologetic framing)
  • Cascade bias (upstream error penalizing innocent agents)

CI/CD Native

Eval isn't a report. It's a gate.

# .github/workflows/multiagent-eval.yml
- name: Run evaluation
  run: multiagent-eval run --config eval_config.yaml
# eval_score < threshold → fail the PR

Statistical Rigor

Bootstrap confidence intervals and permutation p-values on every run. "Did we improve?" becomes answerable.

Failure Mode Taxonomy

Not just a score. A category:

PROPAGATION_ERROR | HALLUCINATION | CONTEXT_LOSS | ORCHESTRATION_BREAK | CASCADE_FAILURE | PII_LEAKAGE


Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         multiagent-eval                                  │
├─────────────────────────────────────────────────────────────────────────┤
│  core/           trace, metrics, runner, state_machine, LLMGateway        │
│  judges/         LLMJudge (CoT, bias), ConsistencyJudge, PropagationJudge│
│  bias_detection/ primacy, verbosity, tone, cascade                       │
│  golden_datasets/ schema, manager, annotator, inter-rater agreement       │
│  reports/        JSON, HTML (D3.js), Streamlit dashboard                  │
│  integrations/   LangGraph, CrewAI, AutoGen, Custom adapters               │
│  telemetry/      OpenTelemetry spans → Datadog, Grafana, Jaeger          │
└─────────────────────────────────────────────────────────────────────────┘

Integrations

| LangGraph | CrewAI | AutoGen | Custom |


Production Features

  • OpenTelemetry: Real-time span emission to Datadog/Grafana/Jaeger
  • PII Detection: Email, SSN, credit card — zero-tolerance config
  • Prompt Injection Detection: Pattern-based, extensible
  • Cost Estimation: Know your budget before you run (estimate-cost --dataset ...)
  • Regression Testing: Which examples degraded between v1.1 and v1.2? (regression-diff)

CLI

multiagent-eval run --config eval_config.yaml
multiagent-eval run --all                    # All examples in golden dataset
multiagent-eval estimate-cost -d datasets/research_qa.json
multiagent-eval regression-diff -a result_v1.json -b result_v2.json
multiagent-eval report --input results.json --format html
multiagent-eval dataset add --name my_dataset
multiagent-eval dashboard

Background

Built by an ML engineer who spent months improving LLM-as-Judge agreement from 63% to 84% in production at Pipedrive — and discovered that most eval problems aren't scoring problems. They're architectural ones.

JudgeGuard (primacy bias detection) came first. multiagent-eval is what came after asking: "What happens to these biases when you have five agents?"


Roadmap

  • Leaderboard / MAE-Bench public benchmark
  • Multi-turn stateful session evaluation
  • Visual diff UI for agent output comparison
  • Automated rubric improvement suggestions
  • Native LangSmith integration

Contributing

Issues, PRs, and dataset contributions welcome. If you're building multi-agent systems and hitting eval problems — open an issue. That's how this gets better.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multiagent_eval-0.1.0.tar.gz (56.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

multiagent_eval-0.1.0-py3-none-any.whl (67.0 kB view details)

Uploaded Python 3

File details

Details for the file multiagent_eval-0.1.0.tar.gz.

File metadata

  • Download URL: multiagent_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 56.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for multiagent_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 12901d17a942114441ed408ffd6ef99b832446ce356359ca7f4399bd2e4ee559
MD5 4d44526f49680469e92146574f74070e
BLAKE2b-256 3ed5684b5c80398a6eea2fbd435fb42fa45b70089c51164e786e47b28f2ed360

See more details on using hashes here.

File details

Details for the file multiagent_eval-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for multiagent_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 da34c538648f1652532dbc8d9d21619adeb2344a69bc41ec85aedc51dd161195
MD5 8aab5d08224773c28f2ca273dd811f52
BLAKE2b-256 dbe4c70cba0fea94a81e2cc45976e133a32619734120c5d7549eae729f7e021b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page