Model drift detection, calibration monitoring, and sequential A/B testing for insurance pricing models: PSI, Gini drift, Murphy decomposition, mSPRT champion/challenger.

These details have not been verified by PyPI

Project links

Project description

insurance-monitoring

Models drift. Regulators notice.

Blog post: Insurance Model Monitoring: Beyond Generic Drift Detection

A pricing model that is 15% cheap on young drivers and 15% expensive on mature drivers reads 1.00 at portfolio level — and triggers no alarm. The loss ratio deteriorates twelve months later. By then, the PRA expects regulated insurers to have model risk management frameworks (and SS1/23, while primarily aimed at banks, is widely adopted as the de facto standard for insurers too) — and the framework should have caught this first. insurance-monitoring detects per-feature distribution shifts and model discrimination drift, not just the headline A/E ratio, so you find the problem before it reaches the accounts or the supervisor.

Part of the Burning Cost stack

Takes the outputs of any fitted pricing model and a stream of actual experience. Feeds drift signals and calibration verdicts into insurance-governance for model risk committee packs. Pairs with insurance-fairness to monitor per-protected-group A/E ratios under Consumer Duty. → See the full stack

Manual monitoring vs insurance-monitoring

Task	Manual approach	insurance-monitoring
Population Stability Index	Excel macro per factor, re-coded each quarter, unweighted	`psi()` / `csi()` — exposure-weighted, Polars-native, traffic-light band
Feature drift heatmap	Engineer writes one-off script; no standard thresholds	`csi()` — one call, all rating factors, PRA-aligned thresholds
A/E ratio with CI	Custom formula in SQL, no confidence interval	`ae_ratio_ci()` — Wilson CI, exposure-weighted, RAG status
Discrimination drift	Gini computed ad hoc; no test for statistical significance	`gini_drift_test()` / `GiniDriftBootstrapTest` — arXiv 2510.04556 bootstrap CI, governance plot
RECALIBRATE vs REFIT decision	Actuary judgment call, not documented	`MonitoringReport.recommendation` — arXiv 2510.04556 decision tree, Murphy decomposition sharpens it
Repeated monthly testing inflation	H-L / A/E tested afresh each month; inflated false-positive rate	`PITMonitor` — anytime-valid, P(ever false alarm
Champion/challenger A/B test	Wait for pre-specified sample size; cannot stop early	`SequentialTest` — mSPRT, valid at every interim check, supports frequency/severity/loss ratio
Drift attribution (which feature explains the performance change?)	PSI-by-eye; no interaction-aware method	`DriftAttributor` / `InterpretableDriftDetector` — TRIPODD, FDR control, exposure weighting
Monitoring report for model risk committee	Manual Word document	`MonitoringReport.to_polars()` — flat DataFrame, writes directly to Delta table

Quick start

import numpy as np
import polars as pl
from insurance_monitoring import MonitoringReport
from insurance_monitoring.drift import psi

rng = np.random.default_rng(42)
# Training period
actual_ref = rng.poisson(0.08, 10_000).astype(float)
predicted_ref = np.full(10_000, 0.08)
exposure_ref = rng.uniform(0.5, 1.0, 10_000)
# Current period — model has drifted
actual_cur = rng.poisson(0.11, 5_000).astype(float)   # loss rate up
predicted_cur = np.full(5_000, 0.08)                   # model unchanged
exposure_cur = rng.uniform(0.5, 1.0, 5_000)

report = MonitoringReport(
    reference_actual=actual_ref, reference_predicted=predicted_ref,
    current_actual=actual_cur, current_predicted=predicted_cur,
    exposure=exposure_cur, murphy_distribution="poisson",
)
print(report.recommendation)          # => 'RECALIBRATE' or 'REFIT'
print(report.to_polars())             # flat DataFrame: metric / value / band

See examples/quickstart.py for a fully self-contained example with feature drift and CSI.

Installation

pip install insurance-monitoring
# or
uv add insurance-monitoring

Dependencies: polars, numpy, scipy, matplotlib

MLflow integration (optional):

pip install insurance-monitoring mlflow

Features

PSI / CSI — Population Stability Index and Characteristic Stability Index across all rating factors; exposure-weighted; traffic-light bands aligned with PRA SS1/23 model risk expectations
A/E ratio with confidence interval — Wilson CI, exposure-weighted, RAG status; the primary calibration metric for UK pricing teams
Gini drift test — two-sample and one-sample bootstrap designs (arXiv 2510.04556); percentile CI; governance plot suitable for model validation packs
Decision tree (MonitoringReport.recommendation) — NO_ACTION / MONITOR_CLOSELY / RECALIBRATE / REFIT / INVESTIGATE; Murphy decomposition sharpens the RECALIBRATE vs REFIT distinction
Murphy decomposition — decomposes scoring loss into uncertainty, discrimination (DSC), and miscalibration (MCB); when DSC falls, REFIT; when MCB dominates, RECALIBRATE
Anytime-valid calibration monitoring (PITMonitor) — probability integral transform e-processes (Henzi, Murph, Ziegel 2025); valid type I error control at every monthly check, forever; solves repeated-testing inflation
Sequential A/B testing (SequentialTest) — mixture SPRT (Johari et al. 2022); supports Poisson frequency, log-normal severity, and compound loss ratio; no pre-specified sample size
TRIPODD drift attribution (DriftAttributor, InterpretableDriftDetector) — feature-interaction-aware; identifies which factors explain model performance degradation; Bonferroni or Benjamini-Hochberg FDR control; exposure weighting; Poisson deviance loss
MLflow integration (MonitoringTracker) — logs all metrics and bands to an MLflow run; requires mlflow (optional dependency)
Polars-native throughout — no pandas required; flat to_polars() output writes directly to Delta tables on Databricks

Modules

`MonitoringReport`

The single entry point for a complete monthly or quarterly monitoring run.

from insurance_monitoring import MonitoringReport

report = MonitoringReport(
    reference_actual=train_claims,
    reference_predicted=train_predicted,
    current_actual=current_claims,
    current_predicted=current_predicted,
    exposure=current_exposure,
    feature_df_reference=train_features,    # pl.DataFrame
    feature_df_current=current_features,    # pl.DataFrame
    features=["driver_age", "vehicle_age", "ncd_years"],
    murphy_distribution="poisson",          # Murphy decomposition — always available
    gini_bootstrap=True,                    # percentile CI on Gini (v0.6.0)
)
print(report.recommendation)   # 'NO_ACTION' | 'RECALIBRATE' | 'REFIT' | 'INVESTIGATE'
print(report.to_dict())        # nested dict — JSON serialisable, log to MLflow
print(report.to_polars())      # flat DataFrame — write to Delta table

Recommendation logic follows the three-stage decision tree from arXiv 2510.04556:

Gini OK + A/E OK → NO_ACTION
A/E bad only → RECALIBRATE (update intercept)
Gini bad → REFIT (rebuild on recent data)
Murphy decomposition present: overrides the heuristic when miscalibration vs discrimination are unambiguous

`drift` — PSI, CSI, KS, Wasserstein

from insurance_monitoring.drift import psi, csi, ks_test, wasserstein_distance
import polars as pl

# PSI on a single feature — exposure-weighted
drift_val = psi(
    reference=driver_age_train,
    current=driver_age_now,
    n_bins=10,
    exposure_weights=exposure_now,    # car-years, not policy count
    reference_exposure=exposure_train,
)
print(f"PSI: {drift_val:.3f}")   # < 0.10 green | 0.10–0.25 amber | > 0.25 red

# CSI heatmap across all rating factors — returns pl.DataFrame with band column
csi_df = csi(
    reference_df=train_df,
    current_df=current_df,
    features=["driver_age", "vehicle_age", "ncd_years", "vehicle_group"],
)
print(csi_df)  # feature | csi | band

Use PSI/CSI for operational dashboards. Use ks_test for formal hypothesis testing at quarter-end (note: over-sensitive at n > 500k). Use wasserstein_distance when communicating to non-technical stakeholders — it reports drift in original feature units.

`calibration` — A/E, balance, Murphy, rectification

from insurance_monitoring import (
    ae_ratio, ae_ratio_ci,
    check_balance, check_auto_calibration,
    murphy_decomposition,
    rectify_balance, isotonic_recalibrate,
    CalibrationChecker,
)

# A/E ratio with Wilson CI
result = ae_ratio_ci(actual, predicted, exposure=exposure)
print(f"A/E = {result['ae']:.3f}  95% CI [{result['lower']:.3f}, {result['upper']:.3f}]")

# Murphy decomposition — distinguishes miscalibration from discrimination loss
murphy = murphy_decomposition(y=actual, y_hat=predicted, exposure=exposure, distribution="poisson")
print(f"DSC (discrimination score): {murphy.discrimination:.4f}")
print(f"MCB (miscalibration):       {murphy.miscalibration:.4f}")
print(f"Verdict: {murphy.verdict}")   # 'RECALIBRATE' or 'REFIT'

# Full calibration audit in one call
checker = CalibrationChecker(distribution="poisson")
report = checker.check(actual, predicted, exposure=exposure)

`discrimination` — Gini drift

from insurance_monitoring import (
    gini_coefficient,
    gini_drift_test_onesample,
    GiniDriftBootstrapTest,
)

# One-sample design: tests monitor data against a stored training Gini scalar
# (more natural for deployed model monitoring — reference data may not be available)
result = gini_drift_test_onesample(
    training_gini=0.42,
    monitor_actual=current_claims,
    monitor_predicted=current_predicted,
    monitor_exposure=current_exposure,
)
print(f"Gini change: {result.gini_change:+.3f}  p={result.p_value:.3f}  [{result.significant}]")

# Class-based API with governance plot
bt = GiniDriftBootstrapTest(
    training_gini=0.42,
    monitor_actual=current_claims,
    monitor_predicted=current_predicted,
    monitor_exposure=current_exposure,
    n_bootstrap=500,
)
bt_result = bt.test()
print(bt.summary())      # governance-ready paragraph (method on the test class, not the result)
bt.plot()                # bootstrap histogram with CI shading — IFoA/PRA deliverable

`calibration.PITMonitor` — anytime-valid calibration monitoring (v0.7.0)

The standard pattern — run a Hosmer-Lemeshow test or check A/E each month — inflates the false-positive rate. After twelve months at α=0.05, the chance of a false alarm exceeds 40% even if the model is perfectly calibrated.

PITMonitor solves this using probability integral transform e-processes (Henzi, Murph, Ziegel 2025, arXiv:2603.13156). The guarantee is: P(ever raise an alarm | model calibrated) ≤ α, for all t, forever.

PITMonitor accepts pre-computed probability integral transforms (PITs) — values in [0, 1] computed from the model's predictive CDF. You compute the PIT for each observation, then pass it to monitor.update():

from insurance_monitoring import PITMonitor
from scipy.stats import poisson

monitor = PITMonitor(alpha=0.05)

# For each new observation: compute PIT from model's predictive CDF, then update
for row in live_data:
    mu = row.exposure * row.lambda_hat          # model's predicted mean
    pit = float(poisson.cdf(row.claims, mu))    # probability integral transform
    alarm = monitor.update(pit)
    if alarm.triggered:
        print(f"Calibration alarm: evidence = {alarm.evidence:.2f}")
        break

`sequential` — anytime-valid A/B testing (v0.8.0)

Champion/challenger pricing experiments are routinely run with a pre-specified end date. Checking early to stop a bad challenger is statistically invalid under classical hypothesis testing. SequentialTest uses mixture SPRT (Johari et al. 2022) — you can check at every interim update and stop early if the challenger is clearly better or worse, with full type I error control.

from insurance_monitoring import SequentialTest

test = SequentialTest(metric="frequency", alpha=0.05)

# Monthly updates — stop whenever the e-value crosses the threshold
for batch in monthly_batches:
    result = test.update(
        champion_claims=batch.champion_claims,
        champion_exposure=batch.champion_exposure,
        challenger_claims=batch.challenger_claims,
        challenger_exposure=batch.challenger_exposure,
    )
    print(f"e-value: {result.lambda_value:.2f}  stopped: {result.should_stop}")
    if result.should_stop:
        break

`drift_attribution` — TRIPODD (v0.4.0+)

PSI tells you that driver_age has drifted. TRIPODD tells you whether that driver_age drift explains why the model's discrimination has fallen — accounting for feature interactions. Use InterpretableDriftDetector (v0.7.0) for high-dimensional feature sets or when exposure weighting is required.

from insurance_monitoring import InterpretableDriftDetector

detector = InterpretableDriftDetector(
    error_control="fdr",    # Benjamini-Hochberg for d >= 10 factors
    loss="poisson_deviance", # canonical for frequency models
)
detector.fit_reference(X_ref, y_ref, weights=exposure_ref)
result = detector.test(X_cur, y_cur, weights=exposure_cur)
print(result.significant_features)  # list of features that explain the performance shift

Regulatory context

PRA SS1/23 (Model Risk Management, March 2023) requires insurers to maintain a model monitoring framework that detects deterioration in model performance. The expectation is documented thresholds, regular testing, and a governance process that escalates to the model risk committee when thresholds are breached. A/E ratio and Gini monitoring are the two metrics most commonly cited in SS1/23 supervisory discussions.

Consumer Duty (PS22/9) requires ongoing monitoring of whether pricing outcomes are fair across customer groups, not just at point of sale. The combination of MonitoringReport and insurance-fairness calibration_by_group() produces a per-protected-group A/E split suitable for Consumer Duty Outcome 4 monitoring.

Expected performance

On a 50,000-policy synthetic UK motor portfolio:

Task	Time	Notes
PSI on one feature	< 0.1s	Exposure-weighted
CSI across 10 features	< 0.5s	Returns Polars DataFrame
`ae_ratio_ci()`	< 0.1s	Wilson CI
`MonitoringReport` (no Murphy)	< 2s	A/E + Gini + CSI
`MonitoringReport` with Murphy	< 5s	Adds MCB/DSC decomposition
`GiniDriftBootstrapTest` (n_bootstrap=500)	5–15s	Percentile CI
`SequentialTest` batch update	< 0.1s	Per monthly update
`InterpretableDriftDetector` (10 features)	30–90s	FDR-controlled bootstrap

Run the full workflow on Databricks

Related libraries

Library	What it does
insurance-fairness	Per-protected-group A/E calibration, proxy detection, Consumer Duty audit reports
insurance-causal	Double ML — establishes whether a rating factor causally drives risk or is a proxy
insurance-governance	Model risk committee packs and FCA Consumer Duty documentation
insurance-gam	Interpretable GLM-style models whose factors can be monitored directly

Questions or feedback? Start a Discussion. Found it useful? A star helps others find it.

Need help implementing this in production? Talk to us.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.2.2

Apr 4, 2026

1.2.1

Apr 4, 2026

1.2.0

Apr 4, 2026

0.11.0

Apr 1, 2026

0.9.5

Apr 1, 2026

This version

0.9.4

Mar 27, 2026

0.9.3

Mar 27, 2026

0.9.2

Mar 26, 2026

0.9.0

Mar 25, 2026

0.8.1

Mar 23, 2026

0.7.1

Mar 22, 2026

0.7.0

Mar 21, 2026

0.6.0

Mar 20, 2026

0.5.0

Mar 20, 2026

0.4.0

Mar 20, 2026

0.3.3

Mar 19, 2026

0.3.2

Mar 15, 2026

0.3.1

Mar 15, 2026

0.2.0

Mar 9, 2026

0.1.0

Mar 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insurance_monitoring-0.9.4.tar.gz (430.3 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

insurance_monitoring-0.9.4-py3-none-any.whl (122.5 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file insurance_monitoring-0.9.4.tar.gz.

File metadata

Download URL: insurance_monitoring-0.9.4.tar.gz
Upload date: Mar 27, 2026
Size: 430.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_monitoring-0.9.4.tar.gz
Algorithm	Hash digest
SHA256	`56821209a306c869379a100f43bb66f28167c2c418e049a5301f2f6d5b84249a`
MD5	`32984e4d7436b3653fcb77e046519112`
BLAKE2b-256	`ea1a8b30021bc44994f322ca6a747e1643d5b3a1cbac400bdf868cb3ea9d538e`

See more details on using hashes here.

File details

Details for the file insurance_monitoring-0.9.4-py3-none-any.whl.

File metadata

Download URL: insurance_monitoring-0.9.4-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 122.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_monitoring-0.9.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`05f46f10803397481aaf4682b00f931bf114c1e39c7e2963804c22d12957f556`
MD5	`da498987d405e73bc99e39f6ecae89d0`
BLAKE2b-256	`724fc235c452a24760b0fa08204a40a759c649503a12f43403d955168e03a5fd`

See more details on using hashes here.

insurance-monitoring 0.9.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

insurance-monitoring

Part of the Burning Cost stack

Manual monitoring vs insurance-monitoring

Quick start

Installation

Features

Modules

MonitoringReport

drift — PSI, CSI, KS, Wasserstein

calibration — A/E, balance, Murphy, rectification

discrimination — Gini drift

calibration.PITMonitor — anytime-valid calibration monitoring (v0.7.0)

sequential — anytime-valid A/B testing (v0.8.0)

drift_attribution — TRIPODD (v0.4.0+)

Regulatory context

Expected performance

Related libraries

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`MonitoringReport`

`drift` — PSI, CSI, KS, Wasserstein

`calibration` — A/E, balance, Murphy, rectification

`discrimination` — Gini drift

`calibration.PITMonitor` — anytime-valid calibration monitoring (v0.7.0)

`sequential` — anytime-valid A/B testing (v0.8.0)

`drift_attribution` — TRIPODD (v0.4.0+)