Skip to main content

The first reliability testing framework for multi-agent AI systems

Project description

swarm-test

Find where your multi-agent AI system breaks — before production does.

Static reliability testing for CrewAI, LangGraph, AutoGen, and custom agent systems. No live LLM calls, no API cost.

PyPI License: MIT swarm-test


The problem

Chain 14 agents at 95% reliability each and your system is ~49% reliable end-to-end (0.95^14). The failures aren't inside any single agent — they're in how they connect: silent cascade failures, hidden single points of failure, fragile dependencies. swarm-test finds them by analyzing your agent topology.

Quickstart

pip install swarm-test
swarm-test run my_crew.py --open

--open launches an interactive D3 dashboard in your browser the moment the run finishes — Swarm Score, force-directed agent graph with single-points-of-failure pulsing red, sortable health and redundancy tables, and every finding grouped by severity.

No real script handy? Build a synthetic topology straight from the CLI:

swarm-test run -a "Orchestrator,Worker1,Worker2" -e "Orchestrator>Worker1,Orchestrator>Worker2"

swarm-test reliability dashboard

What it catches

  • One agent fails and silently takes down everything downstream — cascade failure
  • A single agent the whole system depends on; remove it and the swarm splits — blast radius / SPOF
  • Credentials, PII, or other sensitive data leaking across agent boundaries — context leakage
  • Agents drifting from their assigned role; prompt-injection-style goal hijacking — intent drift
  • A slow upstream with no timeout boundary blocking the whole pipeline — timeout resilience
  • Dense cliques, echo chambers, and cycles that bypass the orchestrator — collusion detection
  • Agents stuck in loops — runaway step counts and retry storms that burn tokens with no error thrown — trajectory analysis
  • Output schema mismatches across agent edges — contract violation (opt-in; provide a contracts YAML)

Features

  • 0–100 Swarm Score with a verdict line (EXCELLENT → CRITICAL) — one-line output for CI
  • Agent role classification (orchestrator, aggregator, validator, gateway, worker, monitor, router) with confidence scores
  • Role-adjusted severity — a validator leaking context is upgraded; an orchestrator's blast radius is downgraded
  • Historical tracking — trend across runs, diffs new vs. resolved findings
  • Interactive HTML report (--open) — D3 force-directed graph, NxN heatmap, filterable findings
  • GitHub Action with PR annotations and job-summary score
  • Graph export to Mermaid, DOT, or PNG (SPOFs red, redundant green)
  • Framework adapters: CrewAI, LangGraph, AutoGen, generic / static graph
  • YAML config (.swarmtest.yml) and entry-point plugin system

CI gate (GitHub Action)

# .github/workflows/swarm-test.yml
on: [pull_request]
jobs:
  swarm-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: surajkumar811/swarm-test@v0.3.0
        with:
          script: my_crew.py
          fail-on-severity: high

Findings appear inline on the PR as ::error:: / ::warning:: / ::notice:: annotations; the Swarm Score is posted to the workflow job summary.

Using it from Python

from swarm_test import SwarmProbe

# Works with a CrewAI Crew, LangGraph CompiledGraph, or AutoGen GroupChatManager
probe  = SwarmProbe(crew, swarm_name="my-crew")
report = probe.run_all()
report.print_summary()
report.to_html("report.html")

Installation

pip install swarm-test
# or with framework extras:
pip install "swarm-test[crewai]"
pip install "swarm-test[langgraph]"
pip install "swarm-test[autogen]"
pip install "swarm-test[png]"        # for PNG graph export

How it works

swarm-test builds a NetworkX directed graph from your agent system — nodes are agents, edges are interactions extracted by each framework adapter. All tests are static graph analyses; no LLM calls are made, and results are deterministic given the same topology.

  • Cascade failure — simulates each agent failing in turn and measures downstream impact.
  • Blast radius — detects articulation points (graph-theoretic SPOFs) and scores every agent on a 0–100 redundancy scale composed of path redundancy (30%), role uniqueness (25%), tool coverage (20%), betweenness centrality (15%), and degree ratio (10%).
  • Context leakage — scans interaction payloads against a sensitive-data regex set extensible from .swarmtest.yml.
  • Intent drift — flags agents whose observed behavior diverges from their declared role; includes prompt-injection heuristics.
  • Collusion — finds dense cliques, echo chambers, and cycles that bypass the declared orchestrator.
  • Timeout resilience — identifies long synchronous chains with no timeout boundary.
  • Trajectory analysis — flags self-loops, ping-pong pairs, multi-agent feedback cycles, unbounded loops with no exit, repeated parallel calls, and cycles deeper than max_trajectory_depth (default 5).
  • Contract violation — validates agent outputs against JSON schemas declared per edge (opt-in; pass --contracts contracts.yml).

Roles are classified from structural metrics (in/out degree, betweenness centrality) plus naming hints, each with a 0–100% confidence score. Severity is then role-adjusted: an orchestrator with high blast radius is expected and gets downgraded; a validator leaking context is a security incident and gets upgraded.

Output modes & formats
Flag Output
--quiet / -q Headline verdict only (one line). Ideal for if checks in CI scripts.
(default) Headline + test results + critical/high findings + SPOFs.
--verbose / -V Every finding, graph metrics, full health and redundancy tables.

Output formats via --output-format: console, json, markdown, html. The same verbosity setting is configurable in .swarmtest.yml.

Graph export
swarm-test graph my_crew.py --format mermaid
swarm-test graph my_crew.py --format dot --output topology.dot
swarm-test graph my_crew.py --format png --output topology.png   # needs the [png] extra

Mermaid renders inline on GitHub, so you can drop the output straight into a README or PR description. Colors: red = SPOF, orange = moderate redundancy, green = fully redundant.

Historical tracking

Every run writes a small JSON snapshot to .swarmtest-history/. Subsequent runs print a trend line below the headline verdict:

Swarm Score: 72/100 — NEEDS IMPROVEMENT (3 critical findings)
Trend: ↑ +18 from last run (was 54) — improving
Recent: 54 → 61 → 58 → 72
✓ 3 findings resolved since last run
⚠ 1 new finding since last run

Browse with swarm-test history show. Disable per-run with --no-history, or globally via history_enabled: false in .swarmtest.yml. .swarmtest-history/ is gitignored by default; commit it if you want the trend to survive across CI machines.

Configuration (.swarmtest.yml)
fail_on_severity: high        # critical | high | medium | low | info | none
max_blast_radius: 0.5         # 0.0 – 1.0
disabled_tests:
  - collusion
sensitive_patterns:
  - "INTERNAL-[A-Z0-9]+"
output_format: html
output_path: ./swarm.html
timeout_seconds: 30
strict: false                 # treat ANY finding as a failure

Auto-discovers .swarmtest.yml, .swarmtest.yaml, swarmtest.yml, or a [tool.swarmtest] table in pyproject.toml. CLI flags always override config-file values. Exit codes from run: 0 (passed), 1 (findings exceed thresholds), 2 (config or runtime error).

Plugin system

Ship custom tests as installable Python packages. Register under the swarm_test.plugins entry-point group; swarm-test auto-discovers and runs them alongside the built-in tests:

[project.entry-points."swarm_test.plugins"]
my_custom_test = "my_package.plugins:MyPlugin"
swarm-test plugins list

See examples/plugin_template/ for a runnable starter.

Framework examples (CrewAI, LangGraph, AutoGen, static)
# CrewAI
from crewai import Crew
from swarm_test import SwarmProbe
SwarmProbe(crew, swarm_name="my-crew").run_all().print_summary()

# LangGraph
from langgraph.graph import StateGraph
from swarm_test import SwarmProbe
SwarmProbe(compiled_graph, swarm_name="my-langgraph").run_all().to_json("report.json")

# AutoGen
from autogen import GroupChatManager
from swarm_test import SwarmProbe
SwarmProbe(manager, swarm_name="my-autogen").run_all().print_summary()

# Static graph (no live framework)
from swarm_test import SwarmProbe, AgentNode, InteractionEvent, EventType
a = AgentNode(name="Fetcher", role="researcher")
b = AgentNode(name="Summarizer", role="writer")
SwarmProbe(
    swarm_name="my-swarm",
    agents=[a, b],
    events=[InteractionEvent(source_agent_id=a.id, target_agent_id=b.id, event_type=EventType.TASK_DELEGATE)],
).run_all().print_summary()

Links

If swarm-test catches a real bug for you, please star the repo — it helps other teams find it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swarm_test-0.3.8.tar.gz (658.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swarm_test-0.3.8-py3-none-any.whl (120.9 kB view details)

Uploaded Python 3

File details

Details for the file swarm_test-0.3.8.tar.gz.

File metadata

  • Download URL: swarm_test-0.3.8.tar.gz
  • Upload date:
  • Size: 658.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for swarm_test-0.3.8.tar.gz
Algorithm Hash digest
SHA256 8e74e3bc9df227cd23eca7f644902d8fb6460486bb4b4fd8a9854aa757efd920
MD5 6d26e3da50a265200b80661d676d21f2
BLAKE2b-256 693291616837244f5b806e84b568c1112043377eaadceefb88058c5dfd91d500

See more details on using hashes here.

File details

Details for the file swarm_test-0.3.8-py3-none-any.whl.

File metadata

  • Download URL: swarm_test-0.3.8-py3-none-any.whl
  • Upload date:
  • Size: 120.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for swarm_test-0.3.8-py3-none-any.whl
Algorithm Hash digest
SHA256 00981c5433c707dc4c1a905737385a5f70ca2419450ee03ee4c5d4af014095ce
MD5 0cc6e09e838323cfe21330956eb289ba
BLAKE2b-256 8680fa39fa2c01c7e60db13636800b98676681553d19e2ed05b4e4f115eac35f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page