Skip to main content

Foreman — a supervisor for AI agent crews: watches for drift, catches loops, and keeps the crew on task (relational wellness monitoring)

Project description

Foreman

CI PyPI Python License

Relational wellness monitoring for AI agent crews. The only monitoring system that watches how agents work together, not just how they perform individually.

Not a coder? Built an agent and just want it improved? See docs/quickstart-novice.md — double-click a launcher, drag in your agent's log, and get pasteable prompt fixes. No code.

pip install foreman-agents
foreman demo                  # 30-second tour, no setup required
wellness improve myrun.txt --goal "Build a REST API"   # no-code: analyze a run + fix prompts

foreman demo

The single-file HTML dashboard (foreman dashboard):

foreman dashboard

60-second integration

from foreman import WellnessCallbackHandler

handler = WellnessCallbackHandler(crew_id="my-team")     # passive, observe-only
app = graph.compile(callbacks=[handler])

# That's it. Every agent message is now monitored.
print(handler.crew_health)            # 0.85
print(handler.is_degraded)            # False
print(handler.threshold)              # 0.65 (or adaptive after warmup)
print(handler.analysis())             # full crew intelligence report

Optional but recommended — wire the outcome feedback loop so the score actually reflects task success:

handler.set_task_context(task_type="technical", task_description="Implement LRU cache class")
result = app.invoke(state)
handler.record_outcome("task_1", source="Coder", success=True)

When you're ready to evaluate, render a dashboard:

# Dump live analysis to JSON, then render a self-contained HTML dashboard
python -c "import json; print(json.dumps(handler.analysis(), default=str))" > today.json
foreman dashboard --analysis today.json --out dashboard.html

Two modes, clearly separated

# PASSIVE (default)  — observe-only.
# Zero mutation of agent inputs, prompts, state, or outputs. Zero LLM calls.
handler = WellnessCallbackHandler(crew_id="my-team")

# INTERVENTION  — opt-in. Adds reframe injection into worker.backstory
# when crew health drops below threshold. Fully reversible.
handler = WellnessCallbackHandler(
    crew_id="my-team",
    intervention_mode=True,
    worker_agents=[agent1, agent2, agent3])

# Switch live, or tear down cleanly:
handler.enable_intervention(worker_agents=[agent1, agent2])
handler.disable_intervention()        # restores backstories
handler.disconnect()                  # restores + flushes telemetry + final report

The observe-only guarantee applies to passive mode and is verified by A/B testing — the handler reads responses after generation and records them; it does not touch agent inputs, prompts, state, or outputs. Intervention mode is opt-in and clearly separated; the only thing it mutates is agent.backstory, and disconnect() restores originals.

Verified by independent A/B testing

Tested on real LangGraph crews with live GPT-4o-mini API calls across 256 tasks:

Property Result
Task correctness impact (passive mode) 0% — handler never alters outputs
Extra LLM calls per task (passive mode) 0
Extra latency (passive mode) 0ms (within API variance)
Extra cost per task (passive mode) $0.00
Non-coding crew false positive rate 0% (40 tasks, v7+v8)

Simulation results

Environment No monitoring + Reactive + Proactive
Standard (40 mixed tasks) 21% 49% 73%
Enterprise (50 CMU-calibrated) 23% 38% 89%
Mid-Market (50 real workflows) 25% 52% 79%
Hostile (4 adversarial scenarios) 16% 28% 73%

Why this matters: AI agents fail 76% of enterprise tasks (Carnegie Mellon, 2025). Unstructured multi-agent networks amplify errors up to 17x (Google DeepMind, 2025).

Install

pip install foreman-agents

No required dependencies. All monitoring runs without API keys or external services.

pip install foreman-agents[langgraph]    # LangGraph callback handler
pip install foreman-agents[otel]         # OpenTelemetry export
pip install foreman-agents[all]          # Everything

If you import WellnessCallbackHandler without langchain-core available, the handler raises LangChainNotInstalledError with the install command. Pass strict=False to use it for direct-API observation outside LangChain.

CLI

foreman demo                                   # 30-second synthetic crew tour
foreman scan -m "won't work" -s Engineer       # one-message signal scan
foreman calibrate --crew my-team --logs interactions.jsonl
foreman health    --crew my-team --bundle fingerprint.json
foreman report    --analysis today.json --out weekly.md
foreman dashboard --analysis today.json --out dashboard.html
foreman info

foreman demo runs a synthetic crew through a degradation-and-recovery scenario in your terminal — useful for verifying install and seeing the value model before wiring anything up.

Scoring profiles

WellnessCallbackHandler(profile="auto")          # default — picks from agent role names
WellnessCallbackHandler(profile="engineering")   # suppresses code-review conflict
WellnessCallbackHandler(profile="support")       # penalizes thin completions / deflection
WellnessCallbackHandler(profile="research")      # tolerates hedging, penalizes groupthink

"auto" inspects the first observed agent role and picks the matching profile. Falls back to engineering if no keywords match.

Privacy

WellnessCallbackHandler(crew_id="my-team", redact=True)

With redact=True, message text is never persisted — only signals plus stable hash prefixes. Use for crews running on customer data or in regulated industries.

Framework connectors — one connect() for any platform

As of v0.4.0 every framework is reachable through a single entry point. connect() returns the right observer or callback handler, already wired to a monitor:

from foreman import connect, available_platforms

# LangGraph — returns a LangChain callback handler
handler = connect("langgraph", crew_id="my-team")
graph.invoke(state, config={"callbacks": [handler]})
print(handler.crew_health)

# Any other framework — returns an observer exposing `.monitor`
obs = connect("crewai", crew_id="my-team")
crew = Crew(..., step_callback=obs.on_step, task_callback=obs.on_task)
print(obs.monitor.crew_health)

# Discover everything available, with verification status
for p in available_platforms():
    print(p["name"], p["status"], "—", p["integration"])

Supported platforms

Platform connect(...) Verification How it hooks in
LangGraph "langgraph" verified-live pass as a callback in config={"callbacks": [handler]}
CrewAI "crewai" built-to-spec step_callback=obs.on_step, task_callback=obs.on_task
AutoGen (v0.2 + v0.4) "autogen" built-to-spec obs.on_message(msg) per message; obs.on_task_done(...)
Google A2A "a2a" built-to-spec obs.on_task_event(event) / obs.on_message(msg)
OpenAI Agents SDK "openai-agents" built-to-spec pass as hooks=obs to Runner.run(...)
Google ADK "google-adk" built-to-spec set after_model_callback / after_agent_callback on the agent
LlamaIndex "llamaindex" built-to-spec dispatcher.add_event_handler(obs)
Semantic Kernel "semantic-kernel" built-to-spec kernel.add_filter("function_invocation", obs.function_invocation_filter)
Pydantic AI "pydantic-ai" built-to-spec obs.on_run_result(result, task_id=..., success=...)

On "verification" — read this honestly. verified-live means the connector has been exercised against a real run of that framework (LangGraph, with live GPT-4o-mini calls across 256 tasks). built-to-spec means it is implemented against the framework's published callback interface and mock-tested, but not yet run against a live install — the target SDKs aren't all pip-installable in our CI. All connectors are defensive: a malformed event can never crash your host pipeline. If you run one live, a bug report (or a thumbs-up) is very welcome.

See examples/ for runnable scripts.

Self-improving manager mode (opt-in)

By default this is a passive monitor. Wrap a crew in SelfImprovingManager and it becomes an active, goal-driven manager that runs without a human prompting it — observe → evaluate → intervene → re-observe → learn:

from foreman import SelfImprovingManager

mgr = SelfImprovingManager(crew_id="my-team",
                           goal="Build a REST API in Python with FastAPI",
                           max_messages=200)         # optional cost circuit-breaker

a = mgr.observe(message, source="Coder")
if a.action == "halt":      # crew looping / over budget / off the rails
    stop()
mgr.record_outcome("task-1", success=True)           # feeds autonomous learning

It adds what a regex-on-text monitor structurally can't: a goal anchor + drift detection, loop/resource accounting (non-termination + token/message budgets), autonomous learning (the scoring profile retrains from your outcomes), and an opt-in LLM-judge tier for silent semantic failures (off by default — pass judge_fn=... to enable; preserves the zero-LLM default).

We built this after stress-testing the passive monitor against the MAST failure taxonomy (1,600+ real traces): on raw text the passive monitor catches ~11–17% of real failures (most multi-agent failures are silent/semantic, not lexical); the manager raises that to ~89% (16/18) with no LLM and 18/18 with the judge tier (validated with a simulated judge), at zero control false positives. The two modes that need the judge — reasoning–action mismatch and incorrect verification — are genuinely semantic; everything else is caught deterministically. Full writeup: docs/self-improving-manager.md.

Two further opt-ins (both off by default) let it improve itself and your agents:

mgr = SelfImprovingManager(
    crew_id="team", goal="...", worker_agents=[coder, reviewer],
    auto_select_interventions=True,   # learn which intervention phrasing works best
    allow_agent_rewrite=True,         # propose durable prompt improvements (gated)
)

a = mgr.observe(message, source="Coder")
if a.needs_review:                    # after steering, it summarizes and asks
    print(a.review_summary)           # "...Apply these improvements? [yes/no]"
    mgr.apply_improvements(approve=user_said_yes)   # ONLY this writes; reversible
    mgr.export_learned("guidance.json")             # carry improvements across runs

allow_agent_rewrite never modifies your agents without explicit approval, writes onto their pristine prompts, and is fully reversible via revert_improvements().

Direct API (no LangChain)

from foreman import WellnessMonitor

monitor = WellnessMonitor(crew_id="my-team", profile="auto")
monitor.observe("message", source="Engineer", target="Designer")
monitor.record_outcome("task_1", "Engineer", success=True)
print(monitor.crew_health)
print(monitor.analysis())

Useful for pipelines that don't run on LangGraph, or for testing.

Webhooks + OpenTelemetry

handler = WellnessCallbackHandler(
    crew_id="ops", enable_otel=True, enable_webhooks=True)

Pipes crew health into your existing observability stack via standard OTel spans (works with LangSmith, Arize, Datadog).

What it monitors

35 signal types across agent messages — 7 hard signals, 7 human soft signals, 21 agent soft signals including 6 code-specific signals. Zero LLM calls.

Code-specific signalscode_retry_loop, code_complexity_creep, code_hallucination, code_abandonment, code_cargo_cult, code_test_avoidance. Detect when coding agents are stuck, over-engineering, or skipping quality checks.

Output quality validationoutput_format_mismatch, output_json_when_code_expected, output_task_incoherence, output_too_short, output_repeated_failure_pattern. Catches when output doesn't match what the task asked for. 10/10 detection rate on a known-failure-output test set.

6 relational dynamics between agents — communication patterns, conflict resolution quality, context coherence, influence concentration, delegation effectiveness, creative synergy.

Adaptive thresholds — learns the crew's normal score range during a warmup period (default 8 tasks), then continuously re-calibrates on a 32-message rolling cadence so the threshold tracks how the crew actually behaves over time. Bounded 200-score rolling history.

Model family support — adjusts hedging detection for GPT, Claude, and Gemini models.

What it does about it

Reactive (intervention mode) — detects degradation and injects targeted reframes into worker backstory strings. Breaks failure spirals. Reversible via disconnect() / disable_intervention().

Proactive (intervention mode) — prevents degradation before it starts. Task routing, readiness checks, smart decomposition with verification loops for technical tasks, failure inoculation.

Consecutive failure tracking — direct score penalty from task failures (via record_outcome), independent of signal detection. Catches degradation even when agents communicate cleanly but produce bad output.

Architecture

8 layers, built on a research-backed model of collaborative team dynamics.

Layer What it does
Core Framework LangGraph StateGraph, signal scanning, memory, scoring
Intelligence Learned classifiers, crew fingerprints, trajectory prediction
Agent Identity Capability confidence, self-models, teammate awareness
Interaction Intelligence 6 modules tracking relational dynamics between agents
Proactive Wellness Pre-task routing, readiness, decomposition, inoculation
Integration OTel spans + metrics, webhook events, fingerprint API
Governance Outcome validation, self-reflection, skill routing, meta-monitor
Socratic Observation Convergence detection, metacognitive process evaluation

Key design decisions

Crew topology awareness. Odd-node pipelines (draft → review → finalize) are collaborative — the finalizer resolves disagreements. Even-node pipelines (solve → review) are adversarial — the reviewer's job is to critique. The engineering scoring profile suppresses conflict/coherence signals for adversarial pipelines because disagreement is the healthy pattern.

Verification loops for technical tasks. Non-technical tasks use majority-wins decomposition (2/3 subtasks succeed = task succeeds). Technical tasks use staged gates with retry (build → verify → retry if failed). This solves the compound probability problem where 3 parallel subtasks at 65% each = 27% joint success, while verification gives 1−(0.35)² = 88%.

Starter bundles

New crews don't have data yet, so the adaptive threshold falls back to 0.65. Skip the 8-task warmup with a pre-calibrated baseline matching your crew topology:

handler = WellnessCallbackHandler(
    crew_id="my-team",
    starter_bundle="eng_3agent",   # draft → review → finalize
)

Built-in bundles: eng_2agent (solve→review), eng_3agent (draft→review→finalize), support_2agent, research_2agent, general_2agent. Once your crew accumulates real data the rolling re-calibration takes over and the starter values stop mattering.

Privacy + telemetry

No telemetry. Nothing phones home. The package never makes outbound network calls except through the integrations you explicitly enable (OpenTelemetry export to your collector, webhooks to your endpoint). Verifiable by inspection — there is no requests.post, no analytics, no sign-in.

WellnessCallbackHandler(crew_id="my-team", redact=True)

With redact=True, message text is never persisted — only signal hashes. Use for crews running on customer data or in regulated industries.

Public benchmark

A reproducible benchmark suite ships with the package. 30 tasks with ground-truth labels across 7 failure modes (retry loop, hostile, silent drift, hedging spiral, format break, ownership loss, context loss). Deterministic — no LLM calls.

wellness benchmark --profile engineering --verbose

v0.4.0 results on the default suite:

Metric Value
Precision 1.00
Recall 1.00
F1 1.00
Accuracy 1.00
False positives on healthy code-review tasks 0 / 12

v0.3.1 missed all 3 hostile-input tasks: the hard hostility signals (direct_insult, all_caps_rage, threatening_language, and three more) were escalated by the live handler but carried no weight in the health score, so a hostile session scored 0.66 — just above the 0.65 flag threshold. v0.3.2 assigns those signals real weights; hostile sessions now score ~0.27 and are flagged decisively, with no new false positives (healthy tasks still score 1.00). One thin margin remains and is tracked honestly: context_loss lands at 0.64 vs the 0.65 threshold, because its primary signal (coherence_needs_realignment) is intentionally suppressed in the engineering profile to avoid code-review false positives.

Early access

We're looking for 10 teams running multi-agent crews in production to join early access.

  • What you get: direct setup support, free fingerprint calibration, access to starter bundle library, priority feature requests.
  • What we learn: how relational monitoring performs on real production crews, which signal types matter most for your domain.

Open an issue titled "Early Access" with your crew type and rough agent count.

Contributing

Found a bug? Open an issue. Want a feature? Open an issue describing the use case. PRs welcome for the core framework.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

foreman_agents-0.9.0.tar.gz (189.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

foreman_agents-0.9.0-py3-none-any.whl (181.9 kB view details)

Uploaded Python 3

File details

Details for the file foreman_agents-0.9.0.tar.gz.

File metadata

  • Download URL: foreman_agents-0.9.0.tar.gz
  • Upload date:
  • Size: 189.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for foreman_agents-0.9.0.tar.gz
Algorithm Hash digest
SHA256 76881302ccf9cee56ba7caceecdc1dba195e1e23f26f3ae6eb2f4b2c3e4edeac
MD5 3c522abf9673322f588be277796d5498
BLAKE2b-256 af7ae1bb3b47473606cdd23441613ba9bc3cdc2b5fa6fe4aae7b2c455b3aaf17

See more details on using hashes here.

File details

Details for the file foreman_agents-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: foreman_agents-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 181.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for foreman_agents-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 56526bea80f54783ee9cbd8700c61d65552e46bf1cd7c093a14ac3663d8eee8f
MD5 b5b80515c7c695cbd438541895707cd3
BLAKE2b-256 4e667e8353d272775fa9ea97aa5b2c2337cc4fee4dea71587540bbdf510df6ba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page