Skip to main content

Online Lyapunov-drift monitor for ML retraining loops: alert when the loop trends unstable, before eval metrics show it.

Project description

lyapmon

ci PyPI Python License: Apache-2.0

Online Lyapunov-drift monitoring for ML retraining loops.

Your retraining DAG is a closed-loop dynamical system: the model shapes the data that trains the next model (data → train → deploy → data). Closed loops can go unstable — exposure bias, label feedback, recursive training — and when they do, the holdout eval is the last place it shows up.

lyapmon watches the loop the way control engineering watches a plant. Each cycle it builds a small state vector x_k from observables the pipeline already has, evaluates a Lyapunov candidate V(x_k), and runs an online test on the drift

ΔV_k = V(x_k+1) − V(x_k)

While the expected drift is negative the loop is contracting toward its commissioned-good state and may run autonomously. The first sustained positive drift fires an alert — and, wired as an Airflow gate, blocks the auto-deploy edge and pulls a human back in. That is bounded delegation packaged as an observability tool: the loop earns its autonomy cycle by cycle, and loses it the moment the stability evidence does.

ingest ──▶ train ──▶ evaluate ──▶ lyapunov_gate ──▶ deploy
                                       │
                                       ▼  E[ΔV] > 0, sustained
                                  ✗ fail task: block deploy, page a human

Why drift on V, not a threshold on eval loss?

Eval loss on a fixed holdout grows quadratically in the model's bias — it stays inside its noise band long after the loop has gone divergent. Distribution observables (training-batch PSI, prediction shift, parameter movement) grow linearly, and a trend test on increments fires before a level test on a lagging metric. The bundled simulation measures exactly this lead time against the rule it replaces (eval mean + 3σ, 2 consecutive):

$ lyapmon simulate --feedback-gain 0.65
...
lyapmon UNSTABLE at cycle 34
naive eval-loss alarm (mean+3σ, 2 consecutive) at cycle 42
lead time: 8 cycles

Mean lead over a 10-seed sweep is ~4 cycles with zero false alarms on stable and near-critical loops (asserted in tests/test_sim.py). With delayed outcome labels — the usual production reality — the lead widens (--label-delay 5 → 12 cycles), because the state vector is built from label-free observables that stay current while the eval waits for labels.

demo plot

Install

pip install lyapmon              # core: numpy only
pip install 'lyapmon[mlflow]'    # + MLflow logging/backfill
pip install 'lyapmon[prometheus]'# + Pushgateway export
pip install 'lyapmon[plot]'      # + simulation plots

Quickstart

from lyapmon import LyapunovMonitor, JSONLStore, WebhookAlerter, psi, mean_shift

monitor = LyapunovMonitor(
    features=["eval_auc", "psi_train", "pred_shift", "weight_delta"],
    warmup=10,                                  # cycles assumed healthy; fits V
    store=JSONLStore("/shared/lyapmon/history.jsonl"),
    alerters=[WebhookAlerter("https://hooks.slack.com/services/...")],
)

verdict = monitor.observe(
    {
        "eval_auc": auc,
        "psi_train": psi(reference_features, batch_features),
        "pred_shift": mean_shift(reference_preds, current_preds),
        "weight_delta": weight_delta_norm(prev_weights, new_weights),
    },
    cycle_id=run_id,
)

if verdict.unstable:
    block_deploy()   # verdict.top_contributors says which observable moved

The monitor is stateless across processes — everything (baseline, detector state, previous V) checkpoints into the store, so a fresh instance per DAG run behaves identically to a long-lived one (this is tested).

Airflow gate

from lyapmon.integrations.airflow import lyapunov_gate_callable
from airflow.operators.python import PythonOperator

gate = PythonOperator(
    task_id="lyapunov_gate",
    python_callable=lyapunov_gate_callable,
    op_kwargs=dict(
        features=["eval_auc", "psi_train", "pred_shift", "weight_delta"],
        history_path="/shared/lyapmon/history.jsonl",
        xcom_task_id="evaluate",        # evaluate task pushes the metrics dict
    ),
)
ingest >> train >> evaluate >> gate >> deploy

On sustained positive drift the gate raises LoopUnstableError: the deploy never runs, the DAG run is red, your existing on-call alerting takes it from there. After remediation, monitor.rebaseline() (or delete the checkpoint) re-commissions the loop with a fresh warmup.

MLflow

from lyapmon.integrations.mlflow import log_verdict, states_from_experiment

log_verdict(verdict)                 # lyapmon.V / .delta_V / .drift next to your run metrics

# Backfill a monitor over an existing retraining history:
for run_id, metrics in states_from_experiment("churn-retrain", FEATURES):
    monitor.observe(metrics, cycle_id=run_id)

Prometheus / Grafana

from lyapmon.integrations.prometheus import write_textfile
write_textfile(verdict, "/var/lib/node_exporter/lyapmon.prom", {"pipeline": "churn"})

Alert on lyapmon_status >= 3; graph lyapmon_drift against lyapmon_drift_threshold for the money chart.

Shell / BashOperator

lyapmon check --history /shared/history.jsonl \
  --features eval_auc,psi_train --metrics '{"eval_auc":0.91,"psi_train":0.04}' \
  --fail-on-unstable
lyapmon report --history /shared/history.jsonl

How it works

  1. State vector. You name the observables; helpers (psi, ks_distance, mean_shift, rate_shift, weight_delta_norm) compute the standard ones from raw arrays. Everything is sample-only — no oracle access to truth.
  2. Lyapunov candidate. Default is a diagonal Mahalanobis distance to a baseline fitted on the warmup window: V(x) = Σᵢ ((xᵢ − x*ᵢ)/σᵢ)² — positive definite around the commissioned-good state, unitless across mixed-scale features. A full quadratic form (QuadraticV) or any callable (CallableV, e.g. a learned/certified candidate) drops in unchanged.
  3. Drift test. The conditional drift E[ΔV|x] is estimated by an EWMA of the increments; the alert threshold is calibrated from warmup noise (z · σ_ΔV · √(λ/(2−λ))) and must be breached consecutive cycles. A one-sided Page-Hinkley accumulator runs alongside to catch slow drift that hides under the EWMA threshold. Either detector ⇒ UNSTABLE.
  4. Verdict. STABLE / WARNING / UNSTABLE plus the numbers and the top contributors to V (which observable is pushing the loop out).

The theory anchor is the Foster–Lyapunov drift criterion: negative expected one-step drift of a positive-definite V outside a small set implies stochastic stability. lyapmon monitors the empirical contrapositive — when the drift estimate turns and stays positive, the contraction evidence is gone, so the autonomy should be too. It is an early-warning instrument, not a certificate; for the certificate-side story (CEGIS-learned, dReal-verified candidates) see the companion project lyacert.

Demo

lyapmon simulate --feedback-gain 0.3            # below critical gain: stable forever
lyapmon simulate --feedback-gain 0.65           # slow-burn divergence, alarm + lead time
lyapmon simulate --feedback-gain 0.65 --plot demo.png

The simulated loop retrains on data partially generated under its own influence (exposure bias with amplification κ); the closed-loop pole is 1 − lr + lr·g·κ, so instability is a knob, not an anecdote — critical gain g* = 1/κ exactly. See demo/DEMO.md for the full Airflow + MLflow conference demo and talk track.

Development

uv venv .venv && uv pip install -e '.[dev,plot]'
.venv/bin/pytest
.venv/bin/ruff check src tests

See CONTRIBUTING.md for what makes a good PR (new state helpers, new orchestrator gates, detector invariants).

Citing

If you use lyapmon in your work, please cite it (CITATION.cff):

@software{lyapmon,
  author  = {Nguyen, Thuy},
  title   = {lyapmon: online Lyapunov-drift monitoring for ML retraining loops},
  url     = {https://github.com/sophie-nguyenthuthuy/lyapmon},
  version = {0.1.0},
  year    = {2026},
  license = {Apache-2.0},
}

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lyapmon-0.1.0.tar.gz (137.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lyapmon-0.1.0-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file lyapmon-0.1.0.tar.gz.

File metadata

  • Download URL: lyapmon-0.1.0.tar.gz
  • Upload date:
  • Size: 137.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lyapmon-0.1.0.tar.gz
Algorithm Hash digest
SHA256 72bb9c1795f36b4912efde32d0b32aced25ec03ae82b7ba9b5121e7686388cba
MD5 4fca95d5b55582c5a2e83de7e890b80c
BLAKE2b-256 b6214ac0dc4cb0bcc7e1594b2cd2fd11049c163c2dca8f4a45dffdea2e327a9f

See more details on using hashes here.

Provenance

The following attestation bundles were made for lyapmon-0.1.0.tar.gz:

Publisher: release.yml on sophie-nguyenthuthuy/lyapmon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lyapmon-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: lyapmon-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lyapmon-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e6aa54c0a818f1546e51e2bb6e05bb9711c616c28a8a99d861b0f73423ab8112
MD5 418d15487537cb87eb2d5152c627a41e
BLAKE2b-256 d06c2190b7c22422d1002c55edbf64b8da3aad8eba462478e7e64f1ab09caa0a

See more details on using hashes here.

Provenance

The following attestation bundles were made for lyapmon-0.1.0-py3-none-any.whl:

Publisher: release.yml on sophie-nguyenthuthuy/lyapmon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page