Online Lyapunov-drift monitor for ML retraining loops: alert when the loop trends unstable, before eval metrics show it.
Project description
lyapmon
Online Lyapunov-drift monitoring for ML retraining loops.
Your retraining DAG is a closed-loop dynamical system: the model shapes the
data that trains the next model (data → train → deploy → data). Closed
loops can go unstable — exposure bias, label feedback, recursive training —
and when they do, the holdout eval is the last place it shows up.
lyapmon watches the loop the way control engineering watches a plant. Each
cycle it builds a small state vector x_k from observables the pipeline
already has, evaluates a Lyapunov candidate V(x_k), and runs an online test
on the drift
ΔV_k = V(x_k+1) − V(x_k)
While the expected drift is negative the loop is contracting toward its commissioned-good state and may run autonomously. The first sustained positive drift fires an alert — and, wired as an Airflow gate, blocks the auto-deploy edge and pulls a human back in. That is bounded delegation packaged as an observability tool: the loop earns its autonomy cycle by cycle, and loses it the moment the stability evidence does.
ingest ──▶ train ──▶ evaluate ──▶ lyapunov_gate ──▶ deploy
│
▼ E[ΔV] > 0, sustained
✗ fail task: block deploy, page a human
Why drift on V, not a threshold on eval loss?
Eval loss on a fixed holdout grows quadratically in the model's bias — it stays inside its noise band long after the loop has gone divergent. Distribution observables (training-batch PSI, prediction shift, parameter movement) grow linearly, and a trend test on increments fires before a level test on a lagging metric. The bundled simulation measures exactly this lead time against the rule it replaces (eval mean + 3σ, 2 consecutive):
$ lyapmon simulate --feedback-gain 0.65
...
lyapmon UNSTABLE at cycle 34
naive eval-loss alarm (mean+3σ, 2 consecutive) at cycle 42
lead time: 8 cycles
Mean lead over a 10-seed sweep is ~4 cycles with zero false alarms on
stable and near-critical loops (asserted in tests/test_sim.py). With
delayed outcome labels — the usual production reality — the lead widens
(--label-delay 5 → 12 cycles), because the state vector is built from
label-free observables that stay current while the eval waits for labels.
Install
pip install lyapmon # core: numpy only
pip install 'lyapmon[mlflow]' # + MLflow logging/backfill
pip install 'lyapmon[prometheus]'# + Pushgateway export
pip install 'lyapmon[plot]' # + simulation plots
Quickstart
from lyapmon import LyapunovMonitor, JSONLStore, WebhookAlerter, psi, mean_shift
monitor = LyapunovMonitor(
features=["eval_auc", "psi_train", "pred_shift", "weight_delta"],
warmup=10, # cycles assumed healthy; fits V
store=JSONLStore("/shared/lyapmon/history.jsonl"),
alerters=[WebhookAlerter("https://hooks.slack.com/services/...")],
)
verdict = monitor.observe(
{
"eval_auc": auc,
"psi_train": psi(reference_features, batch_features),
"pred_shift": mean_shift(reference_preds, current_preds),
"weight_delta": weight_delta_norm(prev_weights, new_weights),
},
cycle_id=run_id,
)
if verdict.unstable:
block_deploy() # verdict.top_contributors says which observable moved
The monitor is stateless across processes — everything (baseline, detector
state, previous V) checkpoints into the store, so a fresh instance per DAG
run behaves identically to a long-lived one (this is tested).
Airflow gate
from lyapmon.integrations.airflow import lyapunov_gate_callable
from airflow.operators.python import PythonOperator
gate = PythonOperator(
task_id="lyapunov_gate",
python_callable=lyapunov_gate_callable,
op_kwargs=dict(
features=["eval_auc", "psi_train", "pred_shift", "weight_delta"],
history_path="/shared/lyapmon/history.jsonl",
xcom_task_id="evaluate", # evaluate task pushes the metrics dict
),
)
ingest >> train >> evaluate >> gate >> deploy
On sustained positive drift the gate raises LoopUnstableError: the deploy
never runs, the DAG run is red, your existing on-call alerting takes it from
there. After remediation, monitor.rebaseline() (or delete the checkpoint)
re-commissions the loop with a fresh warmup.
MLflow
from lyapmon.integrations.mlflow import log_verdict, states_from_experiment
log_verdict(verdict) # lyapmon.V / .delta_V / .drift next to your run metrics
# Backfill a monitor over an existing retraining history:
for run_id, metrics in states_from_experiment("churn-retrain", FEATURES):
monitor.observe(metrics, cycle_id=run_id)
Prometheus / Grafana
from lyapmon.integrations.prometheus import write_textfile
write_textfile(verdict, "/var/lib/node_exporter/lyapmon.prom", {"pipeline": "churn"})
Alert on lyapmon_status >= 3; graph lyapmon_drift against
lyapmon_drift_threshold for the money chart.
Shell / BashOperator
lyapmon check --history /shared/history.jsonl \
--features eval_auc,psi_train --metrics '{"eval_auc":0.91,"psi_train":0.04}' \
--fail-on-unstable
lyapmon report --history /shared/history.jsonl
How it works
- State vector. You name the observables; helpers (
psi,ks_distance,mean_shift,rate_shift,weight_delta_norm) compute the standard ones from raw arrays. Everything is sample-only — no oracle access to truth. - Lyapunov candidate. Default is a diagonal Mahalanobis distance to a
baseline fitted on the warmup window:
V(x) = Σᵢ ((xᵢ − x*ᵢ)/σᵢ)²— positive definite around the commissioned-good state, unitless across mixed-scale features. A full quadratic form (QuadraticV) or any callable (CallableV, e.g. a learned/certified candidate) drops in unchanged. - Drift test. The conditional drift
E[ΔV|x]is estimated by an EWMA of the increments; the alert threshold is calibrated from warmup noise (z · σ_ΔV · √(λ/(2−λ))) and must be breachedconsecutivecycles. A one-sided Page-Hinkley accumulator runs alongside to catch slow drift that hides under the EWMA threshold. Either detector ⇒UNSTABLE. - Verdict.
STABLE/WARNING/UNSTABLEplus the numbers and the top contributors toV(which observable is pushing the loop out).
The theory anchor is the Foster–Lyapunov drift criterion: negative expected
one-step drift of a positive-definite V outside a small set implies
stochastic stability. lyapmon monitors the empirical contrapositive — when
the drift estimate turns and stays positive, the contraction evidence is
gone, so the autonomy should be too. It is an early-warning instrument, not
a certificate; for the certificate-side story (CEGIS-learned, dReal-verified
candidates) see the companion project lyacert.
Demo
lyapmon simulate --feedback-gain 0.3 # below critical gain: stable forever
lyapmon simulate --feedback-gain 0.65 # slow-burn divergence, alarm + lead time
lyapmon simulate --feedback-gain 0.65 --plot demo.png
The simulated loop retrains on data partially generated under its own
influence (exposure bias with amplification κ); the closed-loop pole is
1 − lr + lr·g·κ, so instability is a knob, not an anecdote — critical
gain g* = 1/κ exactly. See demo/DEMO.md for the full
Airflow + MLflow conference demo and talk track.
Development
uv venv .venv && uv pip install -e '.[dev,plot]'
.venv/bin/pytest
.venv/bin/ruff check src tests
See CONTRIBUTING.md for what makes a good PR (new state helpers, new orchestrator gates, detector invariants).
Citing
If you use lyapmon in your work, please cite it (CITATION.cff):
@software{lyapmon,
author = {Nguyen, Thuy},
title = {lyapmon: online Lyapunov-drift monitoring for ML retraining loops},
url = {https://github.com/sophie-nguyenthuthuy/lyapmon},
version = {0.1.0},
year = {2026},
license = {Apache-2.0},
}
License
Apache-2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lyapmon-0.1.0.tar.gz.
File metadata
- Download URL: lyapmon-0.1.0.tar.gz
- Upload date:
- Size: 137.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72bb9c1795f36b4912efde32d0b32aced25ec03ae82b7ba9b5121e7686388cba
|
|
| MD5 |
4fca95d5b55582c5a2e83de7e890b80c
|
|
| BLAKE2b-256 |
b6214ac0dc4cb0bcc7e1594b2cd2fd11049c163c2dca8f4a45dffdea2e327a9f
|
Provenance
The following attestation bundles were made for lyapmon-0.1.0.tar.gz:
Publisher:
release.yml on sophie-nguyenthuthuy/lyapmon
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lyapmon-0.1.0.tar.gz -
Subject digest:
72bb9c1795f36b4912efde32d0b32aced25ec03ae82b7ba9b5121e7686388cba - Sigstore transparency entry: 1779372381
- Sigstore integration time:
-
Permalink:
sophie-nguyenthuthuy/lyapmon@e30726acde6ee5638a0ab1bda0ac5523cd53b44f -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/sophie-nguyenthuthuy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e30726acde6ee5638a0ab1bda0ac5523cd53b44f -
Trigger Event:
push
-
Statement type:
File details
Details for the file lyapmon-0.1.0-py3-none-any.whl.
File metadata
- Download URL: lyapmon-0.1.0-py3-none-any.whl
- Upload date:
- Size: 27.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6aa54c0a818f1546e51e2bb6e05bb9711c616c28a8a99d861b0f73423ab8112
|
|
| MD5 |
418d15487537cb87eb2d5152c627a41e
|
|
| BLAKE2b-256 |
d06c2190b7c22422d1002c55edbf64b8da3aad8eba462478e7e64f1ab09caa0a
|
Provenance
The following attestation bundles were made for lyapmon-0.1.0-py3-none-any.whl:
Publisher:
release.yml on sophie-nguyenthuthuy/lyapmon
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lyapmon-0.1.0-py3-none-any.whl -
Subject digest:
e6aa54c0a818f1546e51e2bb6e05bb9711c616c28a8a99d861b0f73423ab8112 - Sigstore transparency entry: 1779372525
- Sigstore integration time:
-
Permalink:
sophie-nguyenthuthuy/lyapmon@e30726acde6ee5638a0ab1bda0ac5523cd53b44f -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/sophie-nguyenthuthuy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e30726acde6ee5638a0ab1bda0ac5523cd53b44f -
Trigger Event:
push
-
Statement type: