Foreman — a supervisor for AI agent crews: watches for drift, catches loops, and keeps the crew on task (relational wellness monitoring)

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
Programming Language
Topic
- Software Development :: Libraries :: Python Modules

Project description

Foreman

Relational wellness monitoring for AI agent crews. The only monitoring system that watches how agents work together, not just how they perform individually.

Not a coder? Built an agent and just want it improved? See docs/quickstart-novice.md — double-click a launcher, drag in your agent's log, and get pasteable prompt fixes. No code.

pip install foreman-agents
foreman demo                  # 30-second tour, no setup required
wellness improve myrun.txt --goal "Build a REST API"   # no-code: analyze a run + fix prompts

foreman demo

The single-file HTML dashboard (foreman dashboard):

foreman dashboard

60-second integration

from foreman import WellnessCallbackHandler

handler = WellnessCallbackHandler(crew_id="my-team")     # passive, observe-only
app = graph.compile(callbacks=[handler])

# That's it. Every agent message is now monitored.
print(handler.crew_health)            # 0.85
print(handler.is_degraded)            # False
print(handler.threshold)              # 0.65 (or adaptive after warmup)
print(handler.analysis())             # full crew intelligence report

Optional but recommended — wire the outcome feedback loop so the score actually reflects task success:

handler.set_task_context(task_type="technical", task_description="Implement LRU cache class")
result = app.invoke(state)
handler.record_outcome("task_1", source="Coder", success=True)

When you're ready to evaluate, render a dashboard:

# Dump live analysis to JSON, then render a self-contained HTML dashboard
python -c "import json; print(json.dumps(handler.analysis(), default=str))" > today.json
foreman dashboard --analysis today.json --out dashboard.html

Two modes, clearly separated

# PASSIVE (default)  — observe-only.
# Zero mutation of agent inputs, prompts, state, or outputs. Zero LLM calls.
handler = WellnessCallbackHandler(crew_id="my-team")

# INTERVENTION  — opt-in. Adds reframe injection into worker.backstory
# when crew health drops below threshold. Fully reversible.
handler = WellnessCallbackHandler(
    crew_id="my-team",
    intervention_mode=True,
    worker_agents=[agent1, agent2, agent3])

# Switch live, or tear down cleanly:
handler.enable_intervention(worker_agents=[agent1, agent2])
handler.disable_intervention()        # restores backstories
handler.disconnect()                  # restores + flushes telemetry + final report

The observe-only guarantee applies to passive mode and is verified by A/B testing — the handler reads responses after generation and records them; it does not touch agent inputs, prompts, state, or outputs. Intervention mode is opt-in and clearly separated; the only thing it mutates is agent.backstory, and disconnect() restores originals.

Verified by independent A/B testing

Tested on real LangGraph crews with live GPT-4o-mini API calls across 256 tasks:

Property	Result
Task correctness impact (passive mode)	0% — handler never alters outputs
Extra LLM calls per task (passive mode)	0
Extra latency (passive mode)	0ms (within API variance)
Extra cost per task (passive mode)	$0.00
Non-coding crew false positive rate	0% (40 tasks, v7+v8)

Simulation results

Environment	No monitoring	+ Reactive	+ Proactive
Standard (40 mixed tasks)	21%	49%	73%
Enterprise (50 CMU-calibrated)	23%	38%	89%
Mid-Market (50 real workflows)	25%	52%	79%
Hostile (4 adversarial scenarios)	16%	28%	73%

Why this matters: AI agents fail 76% of enterprise tasks (Carnegie Mellon, 2025). Unstructured multi-agent networks amplify errors up to 17x (Google DeepMind, 2025).

Install

pip install foreman-agents

No required dependencies. All monitoring runs without API keys or external services.

pip install foreman-agents[langgraph]    # LangGraph callback handler
pip install foreman-agents[otel]         # OpenTelemetry export
pip install foreman-agents[all]          # Everything

If you import WellnessCallbackHandler without langchain-core available, the handler raises LangChainNotInstalledError with the install command. Pass strict=False to use it for direct-API observation outside LangChain.

CLI

foreman demo                                   # 30-second synthetic crew tour
foreman scan -m "won't work" -s Engineer       # one-message signal scan
foreman calibrate --crew my-team --logs interactions.jsonl
foreman health    --crew my-team --bundle fingerprint.json
foreman report    --analysis today.json --out weekly.md
foreman dashboard --analysis today.json --out dashboard.html
foreman info

foreman demo runs a synthetic crew through a degradation-and-recovery scenario in your terminal — useful for verifying install and seeing the value model before wiring anything up.

Scoring profiles

WellnessCallbackHandler(profile="auto")          # default — picks from agent role names
WellnessCallbackHandler(profile="engineering")   # suppresses code-review conflict
WellnessCallbackHandler(profile="support")       # penalizes thin completions / deflection
WellnessCallbackHandler(profile="research")      # tolerates hedging, penalizes groupthink

"auto" inspects the first observed agent role and picks the matching profile. Falls back to engineering if no keywords match.

Privacy

WellnessCallbackHandler(crew_id="my-team", redact=True)

With redact=True, message text is never persisted — only signals plus stable hash prefixes. Use for crews running on customer data or in regulated industries.

Framework connectors — one `connect()` for any platform

As of v0.4.0 every framework is reachable through a single entry point. connect() returns the right observer or callback handler, already wired to a monitor:

from foreman import connect, available_platforms

# LangGraph — returns a LangChain callback handler
handler = connect("langgraph", crew_id="my-team")
graph.invoke(state, config={"callbacks": [handler]})
print(handler.crew_health)

# Any other framework — returns an observer exposing `.monitor`
obs = connect("crewai", crew_id="my-team")
crew = Crew(..., step_callback=obs.on_step, task_callback=obs.on_task)
print(obs.monitor.crew_health)

# Discover everything available, with verification status
for p in available_platforms():
    print(p["name"], p["status"], "—", p["integration"])

Supported platforms

Platform	`connect(...)`	Verification	How it hooks in
LangGraph	`"langgraph"`	verified-live	pass as a callback in `config={"callbacks": [handler]}`
CrewAI	`"crewai"`	built-to-spec	`step_callback=obs.on_step`, `task_callback=obs.on_task`
AutoGen (v0.2 + v0.4)	`"autogen"`	built-to-spec	`obs.on_message(msg)` per message; `obs.on_task_done(...)`
Google A2A	`"a2a"`	built-to-spec	`obs.on_task_event(event)` / `obs.on_message(msg)`
OpenAI Agents SDK	`"openai-agents"`	built-to-spec	pass as `hooks=obs` to `Runner.run(...)`
Google ADK	`"google-adk"`	built-to-spec	set `after_model_callback` / `after_agent_callback` on the agent
LlamaIndex	`"llamaindex"`	built-to-spec	`dispatcher.add_event_handler(obs)`
Semantic Kernel	`"semantic-kernel"`	built-to-spec	`kernel.add_filter("function_invocation", obs.function_invocation_filter)`
Pydantic AI	`"pydantic-ai"`	built-to-spec	`obs.on_run_result(result, task_id=..., success=...)`

On "verification" — read this honestly. verified-live means the connector has been exercised against a real run of that framework (LangGraph, with live GPT-4o-mini calls across 256 tasks). built-to-spec means it is implemented against the framework's published callback interface and mock-tested, but not yet run against a live install — the target SDKs aren't all pip-installable in our CI. All connectors are defensive: a malformed event can never crash your host pipeline. If you run one live, a bug report (or a thumbs-up) is very welcome.

See examples/ for runnable scripts.

Self-improving manager mode (opt-in)

By default this is a passive monitor. Wrap a crew in SelfImprovingManager and it becomes an active, goal-driven manager that runs without a human prompting it — observe → evaluate → intervene → re-observe → learn:

from foreman import SelfImprovingManager

mgr = SelfImprovingManager(crew_id="my-team",
                           goal="Build a REST API in Python with FastAPI",
                           max_messages=200)         # optional cost circuit-breaker

a = mgr.observe(message, source="Coder")
if a.action == "halt":      # crew looping / over budget / off the rails
    stop()
mgr.record_outcome("task-1", success=True)           # feeds autonomous learning

It adds what a regex-on-text monitor structurally can't: a goal anchor + drift detection, loop/resource accounting (non-termination + token/message budgets), autonomous learning (the scoring profile retrains from your outcomes), and an opt-in LLM-judge tier for silent semantic failures (off by default — pass judge_fn=... to enable; preserves the zero-LLM default).

We built this after stress-testing the passive monitor against the MAST failure taxonomy (1,600+ real traces): on raw text the passive monitor catches ~11–17% of real failures (most multi-agent failures are silent/semantic, not lexical); the manager raises that to ~89% (16/18) with no LLM and 18/18 with the judge tier (validated with a simulated judge), at zero control false positives. The two modes that need the judge — reasoning–action mismatch and incorrect verification — are genuinely semantic; everything else is caught deterministically. Full writeup: docs/self-improving-manager.md.

Two further opt-ins (both off by default) let it improve itself and your agents:

mgr = SelfImprovingManager(
    crew_id="team", goal="...", worker_agents=[coder, reviewer],
    auto_select_interventions=True,   # learn which intervention phrasing works best
    allow_agent_rewrite=True,         # propose durable prompt improvements (gated)
)

a = mgr.observe(message, source="Coder")
if a.needs_review:                    # after steering, it summarizes and asks
    print(a.review_summary)           # "...Apply these improvements? [yes/no]"
    mgr.apply_improvements(approve=user_said_yes)   # ONLY this writes; reversible
    mgr.export_learned("guidance.json")             # carry improvements across runs

allow_agent_rewrite never modifies your agents without explicit approval, writes onto their pristine prompts, and is fully reversible via revert_improvements().

Direct API (no LangChain)

from foreman import WellnessMonitor

monitor = WellnessMonitor(crew_id="my-team", profile="auto")
monitor.observe("message", source="Engineer", target="Designer")
monitor.record_outcome("task_1", "Engineer", success=True)
print(monitor.crew_health)
print(monitor.analysis())

Useful for pipelines that don't run on LangGraph, or for testing.

Webhooks + OpenTelemetry

handler = WellnessCallbackHandler(
    crew_id="ops", enable_otel=True, enable_webhooks=True)

Pipes crew health into your existing observability stack via standard OTel spans (works with LangSmith, Arize, Datadog).

What it monitors

35 signal types across agent messages — 7 hard signals, 7 human soft signals, 21 agent soft signals including 6 code-specific signals. Zero LLM calls.

Code-specific signals — code_retry_loop, code_complexity_creep, code_hallucination, code_abandonment, code_cargo_cult, code_test_avoidance. Detect when coding agents are stuck, over-engineering, or skipping quality checks.

Output quality validation — output_format_mismatch, output_json_when_code_expected, output_task_incoherence, output_too_short, output_repeated_failure_pattern. Catches when output doesn't match what the task asked for. 10/10 detection rate on a known-failure-output test set.

6 relational dynamics between agents — communication patterns, conflict resolution quality, context coherence, influence concentration, delegation effectiveness, creative synergy.

Adaptive thresholds — learns the crew's normal score range during a warmup period (default 8 tasks), then continuously re-calibrates on a 32-message rolling cadence so the threshold tracks how the crew actually behaves over time. Bounded 200-score rolling history.

Model family support — adjusts hedging detection for GPT, Claude, and Gemini models.

What it does about it

Reactive (intervention mode) — detects degradation and injects targeted reframes into worker backstory strings. Breaks failure spirals. Reversible via disconnect() / disable_intervention().

Proactive (intervention mode) — prevents degradation before it starts. Task routing, readiness checks, smart decomposition with verification loops for technical tasks, failure inoculation.

Consecutive failure tracking — direct score penalty from task failures (via record_outcome), independent of signal detection. Catches degradation even when agents communicate cleanly but produce bad output.

Architecture

8 layers, built on a research-backed model of collaborative team dynamics.

Layer	What it does
Core Framework	LangGraph StateGraph, signal scanning, memory, scoring
Intelligence	Learned classifiers, crew fingerprints, trajectory prediction
Agent Identity	Capability confidence, self-models, teammate awareness
Interaction Intelligence	6 modules tracking relational dynamics between agents
Proactive Wellness	Pre-task routing, readiness, decomposition, inoculation
Integration	OTel spans + metrics, webhook events, fingerprint API
Governance	Outcome validation, self-reflection, skill routing, meta-monitor
Socratic Observation	Convergence detection, metacognitive process evaluation

Key design decisions

Crew topology awareness. Odd-node pipelines (draft → review → finalize) are collaborative — the finalizer resolves disagreements. Even-node pipelines (solve → review) are adversarial — the reviewer's job is to critique. The engineering scoring profile suppresses conflict/coherence signals for adversarial pipelines because disagreement is the healthy pattern.

Verification loops for technical tasks. Non-technical tasks use majority-wins decomposition (2/3 subtasks succeed = task succeeds). Technical tasks use staged gates with retry (build → verify → retry if failed). This solves the compound probability problem where 3 parallel subtasks at 65% each = 27% joint success, while verification gives 1−(0.35)² = 88%.

Starter bundles

New crews don't have data yet, so the adaptive threshold falls back to 0.65. Skip the 8-task warmup with a pre-calibrated baseline matching your crew topology:

handler = WellnessCallbackHandler(
    crew_id="my-team",
    starter_bundle="eng_3agent",   # draft → review → finalize
)

Built-in bundles: eng_2agent (solve→review), eng_3agent (draft→review→finalize), support_2agent, research_2agent, general_2agent. Once your crew accumulates real data the rolling re-calibration takes over and the starter values stop mattering.

Privacy + telemetry

No telemetry. Nothing phones home. The package never makes outbound network calls except through the integrations you explicitly enable (OpenTelemetry export to your collector, webhooks to your endpoint). Verifiable by inspection — there is no requests.post, no analytics, no sign-in.

WellnessCallbackHandler(crew_id="my-team", redact=True)

With redact=True, message text is never persisted — only signal hashes. Use for crews running on customer data or in regulated industries.

Public benchmark

A reproducible benchmark suite ships with the package. 30 tasks with ground-truth labels across 7 failure modes (retry loop, hostile, silent drift, hedging spiral, format break, ownership loss, context loss). Deterministic — no LLM calls.

wellness benchmark --profile engineering --verbose

v0.4.0 results on the default suite:

Metric	Value
Precision	1.00
Recall	1.00
F1	1.00
Accuracy	1.00
False positives on healthy code-review tasks	0 / 12

v0.3.1 missed all 3 hostile-input tasks: the hard hostility signals (direct_insult, all_caps_rage, threatening_language, and three more) were escalated by the live handler but carried no weight in the health score, so a hostile session scored 0.66 — just above the 0.65 flag threshold. v0.3.2 assigns those signals real weights; hostile sessions now score ~0.27 and are flagged decisively, with no new false positives (healthy tasks still score 1.00). One thin margin remains and is tracked honestly: context_loss lands at 0.64 vs the 0.65 threshold, because its primary signal (coherence_needs_realignment) is intentionally suppressed in the engineering profile to avoid code-review false positives.

Early access

We're looking for 10 teams running multi-agent crews in production to join early access.

What you get: direct setup support, free fingerprint calibration, access to starter bundle library, priority feature requests.
What we learn: how relational monitoring performs on real production crews, which signal types matter most for your domain.

Open an issue titled "Early Access" with your crew type and rough agent count.

Contributing

Found a bug? Open an issue. Want a feature? Open an issue describing the use case. PRs welcome for the core framework.

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
Programming Language
Topic
- Software Development :: Libraries :: Python Modules

Release history Release notifications | RSS feed

This version

0.9.0

Jun 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

foreman_agents-0.9.0.tar.gz (189.2 kB view details)

Uploaded Jun 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

foreman_agents-0.9.0-py3-none-any.whl (181.9 kB view details)

Uploaded Jun 28, 2026 Python 3

File details

Details for the file foreman_agents-0.9.0.tar.gz.

File metadata

Download URL: foreman_agents-0.9.0.tar.gz
Upload date: Jun 28, 2026
Size: 189.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for foreman_agents-0.9.0.tar.gz
Algorithm	Hash digest
SHA256	`76881302ccf9cee56ba7caceecdc1dba195e1e23f26f3ae6eb2f4b2c3e4edeac`
MD5	`3c522abf9673322f588be277796d5498`
BLAKE2b-256	`af7ae1bb3b47473606cdd23441613ba9bc3cdc2b5fa6fe4aae7b2c455b3aaf17`

See more details on using hashes here.

File details

Details for the file foreman_agents-0.9.0-py3-none-any.whl.

File metadata

Download URL: foreman_agents-0.9.0-py3-none-any.whl
Upload date: Jun 28, 2026
Size: 181.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for foreman_agents-0.9.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`56526bea80f54783ee9cbd8700c61d65552e46bf1cd7c093a14ac3663d8eee8f`
MD5	`b5b80515c7c695cbd438541895707cd3`
BLAKE2b-256	`4e667e8353d272775fa9ea97aa5b2c2337cc4fee4dea71587540bbdf510df6ba`

See more details on using hashes here.

foreman-agents 0.9.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Foreman

60-second integration

Two modes, clearly separated

Verified by independent A/B testing

Simulation results

Install

CLI

Scoring profiles

Privacy

Framework connectors — one connect() for any platform

Supported platforms

Self-improving manager mode (opt-in)

Direct API (no LangChain)

Webhooks + OpenTelemetry

What it monitors

What it does about it

Architecture

Key design decisions

Starter bundles

Privacy + telemetry

Public benchmark

Early access

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Framework connectors — one `connect()` for any platform