Skip to main content

MCP server for safe AI agent runtime upgrades — user-driven regression detection (catalog + pre/post snapshot diff + rollback) AND provider-side regression detection (passive fingerprint of latency/response-shape/model-version drift across hosted Anthropic/OpenAI/etc. calls). Read-only advisor: never executes the upgrade itself; operator retains full agency.

Project description

openclaw-upgrade-orchestrator-mcp

MCP server for safe AI agent runtime upgrades — version-aware regression catalog, pre/post snapshot diffing, step-by-step upgrade + rollback guides. Captures deployment state before upgrade, re-runs detection checks after, surfaces new_failures (caused by the upgrade) separately from unchanged_failures (pre-existing) and recovered (fixed by the upgrade). Read-only by design — never executes the upgrade itself; the operator retains full agency. v1.0 ships with the OpenClaw regression catalog (8 entries grounded in real field reports); the same machinery accepts a custom catalog for any AI runtime via Custom MCP Build adapters. Keywords: AI runtime upgrade, regression detection, safe deployment, version-specific bug catalog, AI agent ops.

Status: v1.2.3 Tests: 117 passing License: MIT MCP PyPI


What it does

Production AI runtime upgrades — OpenClaw, Claude Code, agent harnesses, runtime servers — carry recurring regressions that you only find after upgrading. Recent examples:

  • The "claw tax." Anthropic announced standard Claude subscriptions would no longer cover usage through external "claw" harnesses like OpenClaw — forcing those workloads onto metered API billing — without a clean upgrade path. r/AI_Agents thread on the OpenClaw creator suspension (201 pts, April 12 2026) walks through the structural conflict.
  • OpenClaw 2026.4.8 brought a CPU-spike bug. 2026.4.23-26 broke Discord on_message. 2026.4.30+ surfaced OOM under sustained 200k-token contexts. The pattern: upgrade on Friday, hit a new failure mode on Tuesday, spend Wednesday-Thursday excavating release notes + field reports.
  • Claude Code itself has documented compaction failures at 48% context usage (GitHub Issue #23751) and the "1M context window does not work as marketed" complaint (GitHub Issue #35296) — recurring upgrade regressions in the underlying runtime, not just OpenClaw.
  • HERMES.md detection (the billing routing bug) is itself an upgrade regression — Anthropic added detection logic that fires false positives, and operators only discover it when they're already $200 deep in unintended API charges.

This MCP server moves the regression excavation upfront — before the upgrade, not after — and verifies the post-upgrade state by diffing against a snapshot you took beforehand. Read-only by design: never executes the upgrade itself; the operator retains full agency.

> claude: should I upgrade my 2026.4.23 deployment?
[MCP tools: current_version + available_upgrades]

Current: 2026.4.23
Recommended target: 2026.5.2 (no CRITICAL regressions in path)

Available upgrades:
  2026.4.24-.26  HIGH    R-73421 Discord-receive breakage
  2026.4.27       —      clean
  2026.4.30       HIGH    R-OOM-DURING-LARGE-CONTEXT (unfixed)
  2026.5.1-.2     HIGH    R-OOM + R-LOG-ROTATION-DROP (unfixed)
> claude: walk me through upgrading to 2026.4.27.
[MCP tool: upgrade_guide]

2026.4.23 → 2026.4.27 — proceed with mitigations applied.

Applicable known regressions:
  R-41372 (HIGH)  — Cron --session web-search silent fail.
                    Mitigation: silentwatch-mcp covers detection until upgrade.
  R-73421 (HIGH)  — Discord-receive callbacks not firing.
                    Mitigation: `openclaw skill reload discord` after upgrade.

Pre-upgrade steps:
  1. Capture pre-upgrade snapshot (call pre_upgrade_snapshot)
  2. Verify backups: cp -r ~/.openclaw ~/.openclaw.backup-$(date +%Y%m%d)

Upgrade steps:
  1. openclaw gateway stop
  2. openclaw upgrade --to 2026.4.27
  3. openclaw gateway start

Post-upgrade steps:
  1. Run post_upgrade_verify(snapshot_id=<your-pre-upgrade-id>)
  2. openclaw skill reload discord  (R-73421 mitigation)

Rollback steps: stop → openclaw upgrade --to 2026.4.23 → restore backup → start.

Confidence: Path includes 2 HIGH regressions but no CRITICAL.
> claude: I just upgraded. Verify it.
[MCP tool: post_upgrade_verify(pre_snapshot_id="snap-...")]

Upgrade 2026.4.23 → 2026.4.27: SUCCESS.
0 new failures, 1 recovered (skills.discord_receive_registered),
0 unchanged failures.

v1.2 — provider-side regression detection. The other half of the upgrade-safety story: when the hosted LLM provider silently changes their model behavior with no upgrade event on your side. Source: Anthropic April 23 2026 post-mortem — Anthropic admitted silently changing Claude Code's default reasoning effort for 5 weeks (Mar 4 → Apr 7) without notification. Verbatim from a Phoenix user asking for the feature in their own community discussions (#10442): "Does Phoenix have a way to detect this kind of silent drift where surface metrics look healthy but the model is actually failing?" And from Om Patel @om_patel5 on X (171K views, 1.3K likes, 195 RTs): "OPUS 4.6 JUST ADMITTED ITS REASONING EFFORT IS SET TO 25 OUT OF 100" — operator extracted the current setting directly from the model. v1.3 backlog idea: add an active-extraction tool that runs scheduled probes ("what is your current reasoning_effort?") + records the model's self-reported settings over time, complementing v1.2's passive latency/response-shape fingerprinting. This server now does the passive part.

> claude: has Anthropic regressed something on their end in the last hour?
[MCP tool: detect_provider_regression(provider="anthropic")]

Severity: CRITICAL
provider: anthropic
current_window_hours: 1   sample_count: 50
baseline_window_hours: 168  sample_count: 1000

Alerts:
  [CRITICAL] latency_p95: 3,200ms vs 1,500ms baseline (+113%)
  [HIGH]     latency_median: 1,500ms vs 800ms baseline (+87%)
  [MEDIUM]   response_length_median: 350 vs 800 (-56%)

Summary: anthropic: 3 alerts — worst is CRITICAL on latency_p95:
latency_p95 is 113% higher than baseline (3200 vs 1500) — likely regression
> claude: capture the next 100 calls so I can see the fingerprint over time.
[MCP tool: record_provider_call (called by your LLM-client shim, once per response)]

After enough calls accumulate:
[MCP tool: get_provider_fingerprint(provider="anthropic", window_hours=24)]

provider: anthropic
window_hours: 24   sample_count: 240
fingerprint:
  call_count: 240
  median_latency_ms: 850
  p95_latency_ms: 1620
  median_response_length_tokens: 760
  distinct_models: ["claude-sonnet-4-7"]
  most_common_model_version: "claude-sonnet-4-7-20260301"

Why openclaw-upgrade-orchestrator-mcp

Three things existing tools (vendor changelogs, internal runbooks, generic CI/CD orchestrators) don't do:

  1. Catalog-grounded regression awareness. A generic upgrade tool tells you the version exists. This server tells you which versions have known issues, which fix versions remediate them, and which mitigations apply if you have to use the affected version.

  2. Pre/post snapshot diffing tied to the catalog. The same checks run before + after the upgrade. The diff highlights new_failures (caused by the upgrade) separately from unchanged_failures (pre-existing) and recovered (fixed by the upgrade). No more "did this break in 2026.4.27 or was it already broken?"

  3. Read-only by design. Never runs openclaw upgrade --to ... for you. Never modifies state. Operators retain full agency over the actual upgrade — this server gives them the information to make the decision, then verifies it after they execute.

Built for the production-AI operator who owns OpenClaw deployments and has been through enough upgrade-day fire drills.


Tool surface

Tool What it returns
current_version Currently-installed version + detection method
available_upgrades Newer versions with regression-count flags + recommended target
pre_upgrade_snapshot Captures every check's pass/fail state, persists with snapshot_id
upgrade_guide Step-by-step plan: pre / upgrade / post / rollback steps + applicable regressions + confidence note
post_upgrade_verify Diff post-upgrade against a stored pre-upgrade snapshot — new_failures / recovered / unchanged
rollback_guide Recovery plan for a given snapshot — downgrade command + state-restore steps + risk note
regression_catalog Full known-regression catalog, optionally filtered to one version
list_snapshots All stored snapshots (id + version + summary)
record_provider_call (v1.2) Append a single provider API call observation to the fingerprint history
get_provider_fingerprint (v1.2) Aggregate fingerprint over a window — call count, latency p50/p95, response-length distribution, distinct models, most-common headers
detect_provider_regression (v1.2) Compare current vs baseline window; flag distribution shifts with severity

Resources:

  • upgrade://current — current version info
  • upgrade://snapshots — every stored snapshot
  • upgrade://catalog — full regression catalog
  • upgrade://provider-fingerprint (v1.2) — current Anthropic 1-hour fingerprint

Prompts:

  • plan-upgrade(target_version) — walks through the upgrade decision
  • verify-upgrade(pre_snapshot_id) — walks through post-upgrade verification
  • diagnose-provider-regression(provider) (v1.2) — walks through a no-user-upgrade-event regression

Quickstart

Install

pip install openclaw-upgrade-orchestrator-mcp

Configure for Claude Desktop

{
  "mcpServers": {
    "openclaw-upgrade": {
      "command": "python",
      "args": ["-m", "openclaw_upgrade_orchestrator_mcp"],
      "env": {
        "OPENCLAW_UPGRADE_BACKEND": "mock"
      }
    }
  }
}

Backends

Backend Status Description
mock ✅ v1.0 2026.4.23 deployment with active R-73421 Discord-receive breakage; in-memory snapshots; suitable for protocol verification + bundle demos. v1.2: also pre-populates a synthetic 7d-baseline + last-hour-regression-burst on Anthropic so detect_provider_regression returns CRITICAL out of the box
openclaw-system ✅ v1.0 Reads ~/.openclaw/version + ~/.openclaw/gateway.yaml; persists snapshots as JSON in ~/.openclaw/upgrades/snapshots/. Override via OPENCLAW_VERSION_FILE, OPENCLAW_GATEWAY_CONFIG, OPENCLAW_UPGRADE_SNAPSHOT_DIR. v1.2: also reads/writes provider-call records as JSONL in ~/.openclaw/upgrades/provider-calls.jsonl. Override via OPENCLAW_PROVIDER_CALLS_FILE

Regression catalog (v1.0)

8 hand-curated entries covering documented OpenClaw regressions:

  • R-41372-CRON-WEB-SEARCH-SILENT-FAIL (HIGH, 2026.4.20–2026.5.1)
  • R-63002-POST-UPGRADE-CPU-SPIKE (CRITICAL, 2026.4.8–2026.4.10)
  • R-73421-DISCORD-RECEIVE-BREAKAGE (HIGH, 2026.4.23–2026.4.27)
  • R-GATEWAY-PORT-CONFLICT-2026.4.15 (MEDIUM, 2026.4.15–2026.4.18)
  • R-OOM-DURING-LARGE-CONTEXT-2026.4.30 (HIGH, 2026.4.30–unfixed)
  • R-STATUS-RECONCILIATION-DRIFT-2026.4.5 (LOW, 2026.4.5–2026.4.10)
  • R-CLAWHUB-CACHE-POISONING-2026.3.28 (HIGH, 2026.3.28–2026.4.2)
  • R-LOG-ROTATION-DROP-2026.5.1 (MEDIUM, 2026.5.1–unfixed)

Use regression_catalog for the full, queryable list.


Risk-aware recommendation logic

available_upgrades flags every version reachable from current and computes a recommended_target:

For each available version V > current:
  applicable_regressions = regressions_in_path(current, V)
  has_known_critical = any(r.severity == CRITICAL for r in applicable_regressions)

recommended_target = highest V with has_known_critical == False

regressions_in_path(current, target) includes a regression if:

  • The target version is in the regression's range (post-upgrade deployment will be affected), OR
  • The current version is in the regression's range (current deployment is already affected — the operator should know whether the upgrade fixes it)

OpenClaw upgrades atomically (no execution on intermediate versions), so a regression strictly between current and target without affecting either endpoint is NOT included. This avoids over-conservative recommendations.


Roadmap

Version Scope Status
v1.0 mock + openclaw-system backends, 8 tools / 3 resources / 2 prompts, 8-entry regression catalog, 6 detection checks, GitHub Actions CI matrix, PyPI Trusted Publishing
v1.2 Provider-side regression detectionProviderCallRecord data model, 3 new tools (record_provider_call, get_provider_fingerprint, detect_provider_regression), upgrade://provider-fingerprint resource, diagnose-provider-regression prompt. Detects passive distribution shifts in latency/response-shape/model-version when the provider silently changes things on their end. Folded in from research-pass-3 P08 candidate after incumbent validation against Phoenix + Langfuse + Galileo
v1.3 Catalog auto-fetch from upstream changelog feed; richer detection checks tied to OpenClaw's /healthz endpoint; multi-step upgrade pathing
v1.4 Custom catalog packs (operator can ship internal-only regression entries alongside the canonical catalog); rule-overrides
v1.x Webhook emit on detected regression; integration with CI to gate merges of OpenClaw-version bumps

Need this adapted to your stack?

If your AI deployment uses a different runtime (custom agent harness, internal fork of OpenClaw, vendor-locked deployment) and you want the same regression-aware upgrade discipline, that's a Custom MCP Build engagement.

Tier Scope Investment Timeline
Simple Single backend adapter for your existing version-source $8,000–$12,000 1–2 weeks
Standard Custom backend + custom regression catalog (initial 10-15 entries from your incident history) + integration with your alerting $15,000–$25,000 2–4 weeks
Complex Multi-deployment fleet view + auto-catalog ingestion from internal changelog + per-environment recommendation tuning $30,000–$45,000 4–8 weeks

To engage:

  1. Email temur@pixelette.tech with subject Custom MCP Build inquiry — upgrade orchestration
  2. Include: 1-paragraph description of your runtime + which tier
  3. Reply within 2 business days with a 30-min discovery call slot

This server is part of a production-AI infrastructure MCP suite — companion to silentwatch-mcp, openclaw-health-mcp, openclaw-cost-tracker-mcp, and openclaw-skill-vetter-mcp. Install all five for full operational visibility.


Production AI audits

If you're running production AI and want an outside practitioner to score readiness, find the failure patterns already present (upgrade regression cycles being one of the most damaging), and write the corrective-action plan:

Tier Scope Investment Timeline
Audit Lite One system, top-5 findings, written report $1,500 1 week
Audit Standard Full audit, all 14 patterns, 5 Cs findings, 90-day follow-up $3,000 2–3 weeks
Audit + Workshop Standard audit + 2-day team workshop + first monthly audit included $7,500 3–4 weeks

Same email channel: temur@pixelette.tech with subject AI audit inquiry.


Contributing

PRs welcome. Detection checks are pluggable — see src/openclaw_upgrade_orchestrator_mcp/checks/__init__.py for the contract.

To add a check:

  1. Define def run(state: DeploymentState) -> CheckResult in the checks module
  2. Register it in CHECKS: dict[str, callable]
  3. Reference its check_id from a regression's detection_check_id in catalog.py
  4. Add tests in tests/test_checks.py

To add a backend:

  1. Subclass UpgradeBackend in backends/<your_backend>.py
  2. Implement collect_state, save_snapshot, load_snapshot, list_snapshots
  3. Register in backends/__init__.py
  4. Add tests in tests/test_backends.py

To add a regression entry:

  1. Append to CATALOG in catalog.py with stable regression_id
  2. Reference an existing or new detection_check_id (or set to None for advisory-only)
  3. Add a test confirming version-range membership in tests/test_catalog.py

Bug reports + feature requests: open a GitHub issue.


License

MIT — see LICENSE.


Related


Built by Temur Khan — independent practitioner on production AI systems. Contact: temur@pixelette.tech

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openclaw_upgrade_orchestrator_mcp-1.2.5.tar.gz (58.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file openclaw_upgrade_orchestrator_mcp-1.2.5.tar.gz.

File metadata

File hashes

Hashes for openclaw_upgrade_orchestrator_mcp-1.2.5.tar.gz
Algorithm Hash digest
SHA256 0b0c413be559a9072f1b76b82044bbed813c7ab6d371cf97b242d59d21d53bc2
MD5 56daa56136e6c733593c6f0420c0edcf
BLAKE2b-256 c0813dca616e662c8433009b492d20b32a5339b42dea09ce3680ab09fa6cc5bd

See more details on using hashes here.

Provenance

The following attestation bundles were made for openclaw_upgrade_orchestrator_mcp-1.2.5.tar.gz:

Publisher: release.yml on temurkhan13/openclaw-upgrade-orchestrator-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file openclaw_upgrade_orchestrator_mcp-1.2.5-py3-none-any.whl.

File metadata

File hashes

Hashes for openclaw_upgrade_orchestrator_mcp-1.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 1582fdeedca278f7001233e62de90fb19ada0a4cd7a33f12cbb2a7e39dcd79a8
MD5 a76fdd93cd775f28d97c07c29068faad
BLAKE2b-256 b8481abed1a6e3751c9b86e54b5f6a13cdaa00af614267472c4a4b141aeeabf2

See more details on using hashes here.

Provenance

The following attestation bundles were made for openclaw_upgrade_orchestrator_mcp-1.2.5-py3-none-any.whl:

Publisher: release.yml on temurkhan13/openclaw-upgrade-orchestrator-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page