Model drift detection, monitoring, and calibration for insurance pricing models. PSI, CSI, Gini drift, A/E ratios, balance property testing, Murphy decomposition.
Project description
insurance-monitoring
Deployed insurance pricing models go stale. The portfolio ages, the claims environment shifts, regulators change the rules. Without systematic monitoring you find out about it when the loss ratio deteriorates — typically 12 to 18 months after the model started misfiring.
This library gives UK pricing teams two things in one install:
- Ongoing model monitoring — exposure-weighted PSI for feature distribution, A/E ratios with Poisson confidence intervals, and the Gini drift z-test from arXiv 2510.04556.
- Deep calibration diagnostics — balance property testing, auto-calibration, Murphy decomposition (UNC/DSC/MCB), and rectification methods for model sign-off and root-cause analysis (Lindholm & Wüthrich, SAJ 2025).
The two layers serve the same person — the pricing actuary — at different points in the model lifecycle. Use the monitoring layer for monthly/quarterly dashboards. Use the calibration suite when a model needs to be signed off or when monitoring flags a problem you need to diagnose.
No scikit-learn. No pandas. Polars-native throughout.
Installation
uv add insurance-monitoring
Quick example
This example uses named rating factors — which is how actuaries actually work with this data.
import polars as pl
import numpy as np
from insurance_monitoring import MonitoringReport
rng = np.random.default_rng(42)
# Reference period: training window
n_ref = 50_000
pred_ref = rng.uniform(0.05, 0.20, n_ref)
act_ref = rng.poisson(pred_ref).astype(float)
# Current monitoring period: 18 months into deployment
n_cur = 15_000
pred_cur = rng.uniform(0.05, 0.20, n_cur)
act_cur = rng.poisson(pred_cur * 1.08).astype(float) # model is 8% optimistic
# Feature DataFrames with named rating factors — pass these to get CSI per feature
feat_ref = pl.DataFrame({
"driver_age": rng.integers(18, 80, n_ref).tolist(),
"vehicle_age": rng.integers(0, 15, n_ref).tolist(),
"ncd_years": rng.integers(0, 9, n_ref).tolist(),
})
feat_cur = pl.DataFrame({
"driver_age": rng.integers(25, 85, n_cur).tolist(), # older drivers entering book
"vehicle_age": rng.integers(0, 15, n_cur).tolist(),
"ncd_years": rng.integers(0, 9, n_cur).tolist(),
})
report = MonitoringReport(
reference_actual=act_ref,
reference_predicted=pred_ref,
current_actual=act_cur,
current_predicted=pred_cur,
feature_df_reference=feat_ref,
feature_df_current=feat_cur,
features=["driver_age", "vehicle_age", "ncd_years"],
murphy_distribution="poisson",
)
print(report.recommendation)
# 'RECALIBRATE' | 'REFIT' | 'NO_ACTION' | 'INVESTIGATE' | 'MONITOR_CLOSELY'
df = report.to_polars()
# metric | value | band
# ae_ratio | 1.08 | amber
# gini_current | 0.39 | amber
# gini_p_value | 0.054 | amber
# csi_driver_age | 0.14 | amber
# murphy_discrimination | 0.041 | RECALIBRATE
# murphy_miscalibration | 0.003 | RECALIBRATE
# recommendation | nan | RECALIBRATE
If you just want to run a quick sanity check without feature data:
import numpy as np
from insurance_monitoring import MonitoringReport
rng = np.random.default_rng(42)
pred_ref = rng.uniform(0.05, 0.20, 50_000)
act_ref = rng.poisson(pred_ref).astype(float)
pred_cur = rng.uniform(0.05, 0.20, 15_000)
act_cur = rng.poisson(pred_cur * 1.08).astype(float)
report = MonitoringReport(
reference_actual=act_ref,
reference_predicted=pred_ref,
current_actual=act_cur,
current_predicted=pred_cur,
murphy_distribution="poisson",
)
print(report.recommendation)
Worked Example
model_drift_monitoring.py demonstrates the full monitoring stack on a synthetic motor book with three deliberately induced failure modes: covariate shift (older driver mix), calibration deterioration (segment-level A/E drift), and discriminatory power loss (Gini decay). It covers exposure-weighted PSI and CSI, segment A/E ratios with Poisson confidence intervals, the Gini drift z-test, and structured governance reporting suitable for a PRA SS1/23 model risk log.
A Databricks-importable version is also available: Databricks notebook.
Modules
calibration - A/E ratio, calibration suite, Murphy decomposition
The calibration module has two layers. Use A/E for routine monitoring. Use the calibration suite for model sign-off.
A/E ratio monitoring:
from insurance_monitoring.calibration import ae_ratio, ae_ratio_ci
# Aggregate A/E with Poisson CI (exact Garwood intervals)
result = ae_ratio_ci(actual, predicted, exposure=exposure)
# {'ae': 1.08, 'lower': 1.04, 'upper': 1.12, 'n_claims': 342, 'n_expected': 317}
# Segmented A/E: where is the model misfiring?
seg_ae = ae_ratio(
actual, predicted, exposure=exposure,
segments=driver_age_bands,
)
# Returns Polars DataFrame: segment | actual | expected | ae_ratio | n_policies
Calibration suite — model sign-off:
from insurance_monitoring.calibration import CalibrationChecker
checker = CalibrationChecker(distribution='poisson', alpha=0.05)
report = checker.check(y_holdout, y_hat_holdout, exposure_holdout)
print(report.verdict()) # 'OK' | 'RECALIBRATE' | 'REFIT'
print(report.summary()) # human-readable diagnostic paragraph
# Individual components
print(report.balance) # BalanceResult: global A/E ratio with bootstrap CI
print(report.auto_calibration) # AutoCalibResult: per-cohort bootstrap MCB test
print(report.murphy) # MurphyResult: UNC/DSC/MCB/GMCB/LMCB decomposition
Murphy decomposition directly:
from insurance_monitoring.calibration import murphy_decomposition
result = murphy_decomposition(y, y_hat, exposure, distribution='poisson')
# result.uncertainty # baseline deviance (data difficulty)
# result.discrimination # DSC: skill from ranking
# result.miscalibration # MCB: excess from wrong price levels
# result.global_mcb # GMCB: portion fixed by multiplying all predictions by A/E
# result.local_mcb # LMCB: portion requiring model refit
# result.verdict # 'OK' | 'RECALIBRATE' | 'REFIT'
Why two calibration layers? The A/E ratio answers "is the model globally right?". The Murphy decomposition answers "if it is wrong, is it wrong in a cheap way (scale factor) or an expensive way (the ranking is broken)?". You need both to make the RECALIBRATE vs REFIT decision correctly.
On the IBNR problem: the A/E ratio and balance test are only reliable on mature accident periods. For motor, at least 12 months of claims development. For liability, 24+ months. Apply chain-ladder factors first when monitoring recent accident months.
drift - Feature distribution monitoring
from insurance_monitoring.drift import psi, csi, ks_test, wasserstein_distance
import polars as pl
# PSI with exposure weighting (insurance-correct)
score_psi = psi(
reference=score_train,
current=score_q1_2025,
n_bins=10,
exposure_weights=earned_exposure, # car-years, not policy count
)
# CSI heatmap across all rating factors
feature_ref = pl.DataFrame({"driver_age": [...], "vehicle_age": [...], "ncd_years": [...]})
feature_cur = pl.DataFrame({"driver_age": [...], "vehicle_age": [...], "ncd_years": [...]})
csi_table = csi(feature_ref, feature_cur, features=["driver_age", "vehicle_age", "ncd_years"])
# Returns: feature | csi | band
# Wasserstein: report drift in original units
d = wasserstein_distance(driver_ages_train, driver_ages_q1_2025)
print(f"Average driver age shifted by {d:.1f} years")
On exposure-weighted PSI: standard PSI treats every policy equally regardless of how long it was on risk. If your book renews quarterly and mixes 1-month and 12-month policies, unweighted PSI is wrong. The exposure_weights parameter weights bin proportions by earned exposure.
discrimination - Gini drift test
from insurance_monitoring.discrimination import gini_coefficient, gini_drift_test
gini_ref = gini_coefficient(act_ref, pred_ref, exposure=exp_ref)
gini_cur = gini_coefficient(act_cur, pred_cur, exposure=exp_cur)
# Statistical test: has Gini degraded significantly?
# Implements arXiv 2510.04556 Theorem 1
result = gini_drift_test(
reference_gini=gini_ref,
current_gini=gini_cur,
n_reference=50_000,
n_current=15_000,
reference_actual=act_ref, reference_predicted=pred_ref,
current_actual=act_cur, current_predicted=pred_cur,
)
# {'z_statistic': -1.93, 'p_value': 0.054, 'gini_change': -0.03, 'significant': False}
The Gini drift test is the distinguishing feature of this library. Most monitoring tools tell you whether A/E has moved. This tells you whether the model's ranking has degraded — the difference between a cheap recalibration and a full refit.
report - Combined monitoring in one call
from insurance_monitoring import MonitoringReport
report = MonitoringReport(
reference_actual=act_ref,
reference_predicted=pred_ref,
current_actual=act_cur,
current_predicted=pred_cur,
exposure=exposure_cur,
reference_exposure=exposure_ref,
feature_df_reference=feat_ref, # Polars DataFrame
feature_df_current=feat_cur,
features=["driver_age", "vehicle_age", "ncd_years"],
murphy_distribution="poisson",
)
print(report.recommendation)
# 'REFIT' | 'RECALIBRATE' | 'NO_ACTION' | 'INVESTIGATE' | 'MONITOR_CLOSELY'
df = report.to_polars()
# metric | value | band
# ae_ratio | 1.08 | amber
# gini_current | 0.39 | amber
# gini_p_value | 0.054 | amber
# csi_driver_age | 0.14 | amber
# murphy_discrimination | 0.041 | RECALIBRATE
# murphy_miscalibration | 0.003 | RECALIBRATE
# recommendation | nan | RECALIBRATE
thresholds - Configurable traffic lights
from insurance_monitoring.thresholds import MonitoringThresholds, PSIThresholds
# Tighten PSI thresholds for a large motor book with monthly monitoring
custom = MonitoringThresholds(
psi=PSIThresholds(green_max=0.05, amber_max=0.15),
)
report = MonitoringReport(..., thresholds=custom)
Default thresholds follow industry convention (PSI: 0.1/0.25 from FICO/credit scoring; A/E: 0.95–1.05 green, 0.90–1.10 amber; Gini: p < 0.32 amber, p < 0.10 red per arXiv 2510.04556 recommendation).
Decision framework
The recommendation property implements the three-stage decision tree from arXiv 2510.04556, mapped to actuarial practice:
| Signal | Recommendation | Action |
|---|---|---|
| No drift in any test | NO_ACTION | Continue, schedule next review |
| A/E red, Gini stable | RECALIBRATE | Update intercept/offset (hours of work) |
| Gini red | REFIT | Rebuild model on recent data (weeks of work) |
| Both red | INVESTIGATE | Manual review — check data quality first |
| Any amber | MONITOR_CLOSELY | Increase monitoring frequency |
When murphy_distribution is set, the Murphy decomposition sharpens the RECALIBRATE vs REFIT distinction: if GMCB > LMCB (global shift dominates), RECALIBRATE; if LMCB >= GMCB (local structure is broken), REFIT.
Calibration plots
The calibration module includes matplotlib visualisations for model documentation:
from insurance_monitoring.calibration import (
CalibrationChecker,
plot_auto_calibration,
plot_murphy,
plot_calibration_report,
)
checker = CalibrationChecker(distribution='poisson')
report = checker.check(y, y_hat, exposure)
# Three-panel combined figure (auto-calibration + Murphy bar + per-bin heatmap)
fig = plot_calibration_report(report)
fig.savefig("model_calibration_sign_off.pdf")
Databricks integration
The demo notebook at notebooks/demo_monitoring.py shows the full workflow on synthetic motor data and runs on Databricks serverless. Upload it to your workspace and schedule it as a monthly job against your MLflow inference table.
Background
The monitoring framework implements:
"Model Monitoring: A General Framework with an Application to Non-life Insurance Pricing", arXiv 2510.04556 (December 2025)
The calibration suite implements:
Lindholm & Wüthrich: "Three calibration properties for insurance pricing models" (SAJ 2025) Brauer et al.: arXiv:2510.04556 Section 4 — Murphy decomposition and the MCB bootstrap test
Read more
Your Pricing Model is Drifting (and You Probably Can't Tell) — why PSI alone is insufficient, and what it means when A/E is stable but the Gini is falling.
Capabilities Demo
Demonstrated on synthetic UK motor data with three deliberately induced failure modes: covariate shift (older drivers enter the book), calibration deterioration (claim frequency inflated for a segment), and stale discrimination (model trained on old data, portfolio composition changed). Full notebook: notebooks/benchmark.py.
- PSI/CSI flags the covariate shift — feature distributions in the monitoring period diverge from training, triggering configurable traffic lights (PSI > 0.2 = red)
- A/E ratio with confidence intervals catches calibration drift — segment-level actual-to-expected ratios with statistical significance tests, not just point estimates
- Gini drift z-test (arXiv 2510.04556) detects discrimination loss — the discriminatory power of the model has declined, which a standard A/E dashboard would miss
MonitoringReportassembles all three checks into a single traffic-light summary with a recommended action: monitor, investigate, or refit
When to use: Any time more than a month has passed since the last model refit. A typical UK motor pricing cycle is 6–12 months between refits; covariate shift and calibration drift accumulate silently in between. Run the monitoring report monthly on the live book.
Databricks Notebook
A ready-to-run Databricks notebook benchmarking this library against standard approaches is available in burning-cost-examples.
Related libraries
| Library | Why it's relevant |
|---|---|
| shap-relativities | Extract rating relativities from GBMs — when monitoring flags REFIT, use SHAP to diagnose which factors have drifted most |
| insurance-interactions | GLM interaction detection — a refit triggered by Gini degradation may need new interactions added |
| insurance-causal-policy | SDID causal evaluation — if monitoring shows deterioration after a rate change, use this to isolate cause |
| insurance-cv | Walk-forward cross-validation — use monitoring outputs to decide when to retrain and validate the retrained model |
| insurance-optimise | Constrained rate change optimisation — monitoring informs when a rate adjustment is needed; rate-optimiser determines the right one |
Performance
Benchmarked against a manual aggregate A/E ratio check on synthetic UK motor insurance data — 50,000 policies, Poisson GLM trained on 2019–2021, monitored on a deliberately shifted 2023 portfolio: young drivers (under 25) oversampled 2x, high-risk area policies (areas E and F) oversampled 50%, conviction points shifted upward for 20% of policies. Dataset has known DGP so the ground truth for which features have shifted is available.
The central finding: aggregate A/E on the shifted portfolio looks acceptable (near 1.0), because the model's errors partially cancel at portfolio level. MonitoringReport raises RED and AMBER PSI flags for the features that have actually shifted.
| Monitoring check | Manual A/E check | MonitoringReport (PSI/CSI) | Notes |
|---|---|---|---|
| Aggregate A/E — shifted data | Computed | Same value computed | Both agree on A/E; neither should be used alone |
| driver_age distributional shift | Not detected | Expected: PSI RED (>0.25) | 2x young driver oversampling doubles the under-25 proportion |
| area distributional shift | Not detected | Expected: PSI AMBER/RED | High-risk area overweighting detected via PSI |
| conviction_points shift | Not detected | Expected: PSI AMBER | 20% of policies shifted +1 conviction point |
| RED PSI flags raised | 0 | Expected: 1–2 features | Depends on shift magnitude at runtime |
| AMBER PSI flags raised | 0 | Expected: 1–3 features | Configurable thresholds |
| Gini drift (ref → shifted) | Not computed | Computed with bootstrap CI | Statistically tests whether ranking has degraded |
| Structured audit trail | No | Yes (traffic-light report) | Required for PRA SS1/23 model risk documentation |
The manual A/E check is blind to who is inside the portfolio. It will report no alarm while the model is systematically mispricing the fastest-growing segment. PSI per feature catches this. The gap between what A/E reports and what is actually happening grows as the portfolio drifts further from the training distribution.
Run notebooks/benchmark.py on Databricks to reproduce.
Related Libraries
| Library | What it does |
|---|---|
| insurance-deploy | Champion/challenger deployment — monitoring informs when to switch challenger to champion |
| insurance-cv | Walk-forward cross-validation — produces the baseline metrics that monitoring tracks prospectively |
| insurance-covariate-shift | Covariate shift detection and correction — use when monitoring flags PSI drift requiring model adaptation |
Licence
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file insurance_monitoring-0.3.1.tar.gz.
File metadata
- Download URL: insurance_monitoring-0.3.1.tar.gz
- Upload date:
- Size: 166.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39ce86b6b84895ee44ac00e5537037f44e98b9abd124e19b4e446f08e36b0a7c
|
|
| MD5 |
c78d50d8cc50c564ba5095f11553e394
|
|
| BLAKE2b-256 |
dc8c8d9fcf7559168e8a5a17302df1f07cd0883308e2b7123ef2bb58b9ab3d99
|
File details
Details for the file insurance_monitoring-0.3.1-py3-none-any.whl.
File metadata
- Download URL: insurance_monitoring-0.3.1-py3-none-any.whl
- Upload date:
- Size: 57.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d265db1c0eae20c14a500102c3dee75bc3794a5360dd8059257c50a42747f1b
|
|
| MD5 |
1b7cf9895b84d9bada134e445d656bb8
|
|
| BLAKE2b-256 |
f32082e987398e343d5465fafb531c9c5ff88899d2325cddade32db817f8310f
|