Mixture cure models for insurance non-claimer scoring: covariate-aware logistic incidence, parametric and semiparametric latency, EM estimation
Project description
insurance-cure
Mixture cure models for insurance non-claimer scoring.
The problem
Frequency GLMs treat all zero-claim policyholders the same. They do not distinguish between:
- Structural non-claimers — policyholders who would never claim regardless of how long you observed them. A 60-year-old with 9 years NCB driving 5,000 miles a year.
- Lucky susceptibles — policyholders who are genuinely at risk but happened not to claim this year.
These two groups behave differently over multi-year retention horizons. The structural immune cohort will never generate claim cost regardless of tenure. The low-hazard susceptible will eventually claim.
A Poisson GLM cannot tell them apart. A mixture cure model (MCM) can.
What this library does
insurance-cure fits covariate-aware MCMs with a logistic incidence sub-model (who is susceptible?) and a parametric or semiparametric latency sub-model (when do susceptibles claim?). The primary output is a per-policyholder susceptibility score.
The population survival function:
S_pop(t | x, z) = pi(z) * S_u(t | x) + [1 - pi(z)]
pi(z)= P(susceptible), logistic regression incidence sub-modelS_u(t | x)= survival for susceptibles, Weibull/log-normal/Cox latency[1 - pi(z)]= cure fraction: P(never experiences event)
Estimation via EM algorithm (Peng & Dear 2000; Sy & Taylor 2000). Multiple restarts to handle multimodality. Bootstrap standard errors available.
No other pip-installable Python package provides covariate-aware MCM with actuarial output. R has smcure, flexsurvcure, cuRe. Python has nothing. This fills that gap.
Installation
pip install insurance-cure
Dependencies: numpy, scipy, pandas, scikit-learn, lifelines, joblib.
Quick start
import pandas as pd
from insurance_cure import WeibullMixtureCure
from insurance_cure.diagnostics import sufficient_followup_test, CureScorecard
from insurance_cure.simulate import simulate_motor_panel
# Generate synthetic motor panel with known cure fraction 40%
df = simulate_motor_panel(n_policies=3000, cure_fraction=0.40, seed=42)
# ALWAYS check sufficient follow-up before fitting
qn = sufficient_followup_test(df["tenure_months"], df["claimed"])
print(qn.summary())
# Fit Weibull MCM
model = WeibullMixtureCure(
incidence_formula="ncb_years + age + vehicle_age",
latency_formula="ncb_years + age",
n_em_starts=5,
)
model.fit(df, duration_col="tenure_months", event_col="claimed")
print(model.result_.summary())
# Outputs
cure_scores = model.predict_cure_fraction(df) # P(immune) per policy
suscept = model.predict_susceptibility(df) # 1 - cure_fraction
pop_surv = model.predict_population_survival(df, times=[12, 24, 36, 60])
# Validate with scorecard
scorecard = CureScorecard(model, bins=10).fit(df, duration_col="tenure_months", event_col="claimed")
print(scorecard.summary())
Models
WeibullMixtureCure (recommended)
Weibull AFT latency. Clean parametric extrapolation. Best default choice.
from insurance_cure import WeibullMixtureCure
model = WeibullMixtureCure(
incidence_formula="ncb_years + age + vehicle_age",
latency_formula="ncb_years + age",
n_em_starts=5, # EM restarts — use >=5 for production
bootstrap_se=True, # Bootstrap SEs — slow but rigorous
n_bootstrap=200,
n_jobs=-1,
)
model.fit(df, duration_col="tenure_months", event_col="claimed")
LogNormalMixtureCure
Log-normal latency. Better when the conditional hazard peaks then falls — sometimes fits pet or travel data better than Weibull.
from insurance_cure import LogNormalMixtureCure
model = LogNormalMixtureCure(
incidence_formula="pet_age + breed_risk + indoor",
latency_formula="pet_age + breed_risk",
)
model.fit(df)
CoxMixtureCure
Semiparametric Cox PH latency. Nonparametric baseline hazard — most flexible. Cannot extrapolate beyond the observation window. Use for exploration, not production pricing projection.
from insurance_cure import CoxMixtureCure
model = CoxMixtureCure(
incidence_formula="ncb_years + age",
latency_formula="ncb_years",
)
model.fit(df)
PromotionTimeCure
Non-mixture (promotion time) cure model. Population-level proportional hazards structure. Include as comparison model. The cure fraction emerges from the asymptote; there is no explicit incidence sub-model.
from insurance_cure import PromotionTimeCure
model = PromotionTimeCure(formula="ncb_years + age + vehicle_age")
model.fit(df)
Diagnostics
Sufficient follow-up test
The Maller-Zhou Qn test is mandatory. If the observation window is too short, many censored policyholders are simply susceptibles who have not yet claimed, not structural non-claimers. The cure fraction estimate will be upwardly biased.
from insurance_cure.diagnostics import sufficient_followup_test
result = sufficient_followup_test(df["tenure_months"], df["claimed"])
print(result.summary())
# Maller-Zhou Sufficient Follow-Up Test
# ========================================
# Qn statistic : 3.2194
# p-value : 0.0006
# ...
# Conclusion: Sufficient follow-up: evidence for a genuine cure fraction.
Cure scorecard
from insurance_cure.diagnostics import CureScorecard
scorecard = CureScorecard(model, bins=10).fit(df)
print(scorecard.summary())
# Decile 1 (lowest cure) should have highest event rates.
# Decile 10 (highest cure) should have lowest event rates.
Insurance applications
UK motor: First at-fault claim in policy tenure. Event = first claim, time axis = tenure in months. Incidence covariates: NCB years, driver age, vehicle age, occupation. A policyholder with 9 years NCB is a plausible structural non-claimer; a first-year policyholder is not.
Pet insurance: First claim by condition type. Breed, age, indoor/outdoor status drive susceptibility. Indoor cats in early life have very high cure fractions for accidental injury.
Travel insurance: Single-trip non-claimers. Destination, duration, age, trip type (business vs leisure) drive susceptibility.
Where MCM does NOT apply: Buildings (flood, subsidence). Return periods exceed practical follow-up windows. The Qn test will reject sufficient follow-up. Use flood zone categories as structural zero covariates in a standard GLM instead.
Synthetic data
from insurance_cure.simulate import simulate_motor_panel, simulate_pet_panel
# Motor panel: multi-year structure with NCB, age, vehicle age
df = simulate_motor_panel(
n_policies=5000,
n_years=5,
cure_fraction=0.40,
weibull_shape=1.2,
weibull_scale=36.0, # months to first claim for susceptibles
censoring_rate=0.15, # annual lapse rate
seed=42,
)
# Pet panel: cross-sectional
df_pet = simulate_pet_panel(n_policies=2000, cure_fraction=0.35, seed=42)
The true latent immune status is included as is_immune for validation. This column is not available in real data.
EM algorithm details
The EM algorithm decouples into two standard sub-problems at each iteration:
E-step: For censored observation i:
w_i = pi(z_i) * S_u(t_i|x_i) / [pi(z_i) * S_u(t_i|x_i) + (1 - pi(z_i))]
For observed events: w_i = 1 (certainly susceptible).
M-step:
- Logistic regression for gamma using w_i as soft labels
- Weighted Weibull/log-normal MLE for latency parameters, using w_i as case weights
The w_i weights are interpretable posterior susceptibility probabilities. This transparency is a key advantage over direct MLE of the full log-likelihood, which converges less reliably and provides no intermediate interpretation.
Design choices
EM over direct MLE. Direct MLE of the full MCM log-likelihood suffers from negative-definite Hessian problems near the boundaries (cure fraction near 0 or 1). EM converges monotonically. The M-step delegates to proven scipy/sklearn solvers for each sub-problem separately. This is the approach taken by smcure in R.
Separate incidence and latency formulae. Following smcure's cureform / formula convention. In practice, all covariates typically enter the incidence sub-model; only timing-relevant covariates enter the latency.
Multiple restarts. The MCM log-likelihood is multimodal, especially when the cure fraction is near 0 or 1. Five restarts (mix of smart and random initialisations) is a practical default. Increase for production models.
Bootstrap SEs. EM does not directly yield standard errors. The Louis (1982) observed information matrix requires second derivatives of the complete-data log-likelihood — numerically involved. Bootstrap (B=200) is the smcure default and is implemented here via joblib parallel.
References
- Farewell (1982), Biometrics 38:1041-1046 — canonical covariate MCM
- Maller & Zhou (1996), Survival Analysis with Long-Term Survivors, Wiley — identifiability, Qn test
- Peng & Dear (2000), Biometrics 56:237-243 — EM algorithm, semiparametric
- Sy & Taylor (2000), Biometrics 56:227-236 — EM algorithm, Cox latency
- Tsodikov (1998), JRSS-B 60:195-207 — promotion time / non-mixture model
Burning Cost — actuarial Python for UK pricing teams.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file insurance_cure-0.1.0.tar.gz.
File metadata
- Download URL: insurance_cure-0.1.0.tar.gz
- Upload date:
- Size: 36.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d94ef74e9d52d223d9e16b3aaa4efa1563bb5ba49f79fa9d11a827142c23a88
|
|
| MD5 |
3af3100d1c39403660b0de50a7b65966
|
|
| BLAKE2b-256 |
2b876c957781fb504be632b631d4f2b40b90d1558f5ee044117fcc57aaba6f5a
|
File details
Details for the file insurance_cure-0.1.0-py3-none-any.whl.
File metadata
- Download URL: insurance_cure-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41a0d91945ea784fe5a6b9680129ba539aa541896a980f43ead4e834b5984b63
|
|
| MD5 |
3e575880e86d01c5032c677a367ca0e1
|
|
| BLAKE2b-256 |
e6d03f0a751ab17dc9d243c832755c6b96946dc62f9b75f357becaddf2c2fbd6
|