Skip to main content

Mixture cure models for insurance non-claimer scoring: covariate-aware logistic incidence, parametric and semiparametric latency, EM estimation

Project description

insurance-cure

Mixture cure models for insurance non-claimer scoring.

The problem

Frequency GLMs treat all zero-claim policyholders the same. They do not distinguish between:

  1. Structural non-claimers — policyholders who would never claim regardless of how long you observed them. A 60-year-old with 9 years NCB driving 5,000 miles a year.
  2. Lucky susceptibles — policyholders who are genuinely at risk but happened not to claim this year.

These two groups behave differently over multi-year retention horizons. The structural immune cohort will never generate claim cost regardless of tenure. The low-hazard susceptible will eventually claim.

A Poisson GLM cannot tell them apart. A mixture cure model (MCM) can.

What this library does

insurance-cure fits covariate-aware MCMs with a logistic incidence sub-model (who is susceptible?) and a parametric or semiparametric latency sub-model (when do susceptibles claim?). The primary output is a per-policyholder susceptibility score.

The population survival function:

S_pop(t | x, z) = pi(z) * S_u(t | x) + [1 - pi(z)]
  • pi(z) = P(susceptible), logistic regression incidence sub-model
  • S_u(t | x) = survival for susceptibles, Weibull/log-normal/Cox latency
  • [1 - pi(z)] = cure fraction: P(never experiences event)

Estimation via EM algorithm (Peng & Dear 2000; Sy & Taylor 2000). Multiple restarts to handle multimodality. Bootstrap standard errors available.

No other pip-installable Python package provides covariate-aware MCM with actuarial output. R has smcure, flexsurvcure, cuRe. Python has nothing. This fills that gap.

Installation

pip install insurance-cure

Dependencies: numpy, scipy, pandas, scikit-learn, lifelines, joblib.

Quick start

import pandas as pd
from insurance_cure import WeibullMixtureCure
from insurance_cure.diagnostics import sufficient_followup_test, CureScorecard
from insurance_cure.simulate import simulate_motor_panel

# Generate synthetic motor panel with known cure fraction 40%
df = simulate_motor_panel(n_policies=3000, cure_fraction=0.40, seed=42)

# ALWAYS check sufficient follow-up before fitting
qn = sufficient_followup_test(df["tenure_months"], df["claimed"])
print(qn.summary())

# Fit Weibull MCM
model = WeibullMixtureCure(
    incidence_formula="ncb_years + age + vehicle_age",
    latency_formula="ncb_years + age",
    n_em_starts=5,
)
model.fit(df, duration_col="tenure_months", event_col="claimed")
print(model.result_.summary())

# Outputs
cure_scores = model.predict_cure_fraction(df)      # P(immune) per policy
suscept = model.predict_susceptibility(df)          # 1 - cure_fraction
pop_surv = model.predict_population_survival(df, times=[12, 24, 36, 60])

# Validate with scorecard
scorecard = CureScorecard(model, bins=10).fit(df, duration_col="tenure_months", event_col="claimed")
print(scorecard.summary())

Models

WeibullMixtureCure (recommended)

Weibull AFT latency. Clean parametric extrapolation. Best default choice.

from insurance_cure import WeibullMixtureCure

model = WeibullMixtureCure(
    incidence_formula="ncb_years + age + vehicle_age",
    latency_formula="ncb_years + age",
    n_em_starts=5,        # EM restarts — use >=5 for production
    bootstrap_se=True,    # Bootstrap SEs — slow but rigorous
    n_bootstrap=200,
    n_jobs=-1,
)
model.fit(df, duration_col="tenure_months", event_col="claimed")

LogNormalMixtureCure

Log-normal latency. Better when the conditional hazard peaks then falls — sometimes fits pet or travel data better than Weibull.

from insurance_cure import LogNormalMixtureCure

model = LogNormalMixtureCure(
    incidence_formula="pet_age + breed_risk + indoor",
    latency_formula="pet_age + breed_risk",
)
model.fit(df)

CoxMixtureCure

Semiparametric Cox PH latency. Nonparametric baseline hazard — most flexible. Cannot extrapolate beyond the observation window. Use for exploration, not production pricing projection.

from insurance_cure import CoxMixtureCure

model = CoxMixtureCure(
    incidence_formula="ncb_years + age",
    latency_formula="ncb_years",
)
model.fit(df)

PromotionTimeCure

Non-mixture (promotion time) cure model. Population-level proportional hazards structure. Include as comparison model. The cure fraction emerges from the asymptote; there is no explicit incidence sub-model.

from insurance_cure import PromotionTimeCure

model = PromotionTimeCure(formula="ncb_years + age + vehicle_age")
model.fit(df)

Diagnostics

Sufficient follow-up test

The Maller-Zhou Qn test is mandatory. If the observation window is too short, many censored policyholders are simply susceptibles who have not yet claimed, not structural non-claimers. The cure fraction estimate will be upwardly biased.

from insurance_cure.diagnostics import sufficient_followup_test

result = sufficient_followup_test(df["tenure_months"], df["claimed"])
print(result.summary())
# Maller-Zhou Sufficient Follow-Up Test
# ========================================
#   Qn statistic      : 3.2194
#   p-value           : 0.0006
#   ...
#   Conclusion: Sufficient follow-up: evidence for a genuine cure fraction.

Cure scorecard

from insurance_cure.diagnostics import CureScorecard

scorecard = CureScorecard(model, bins=10).fit(df)
print(scorecard.summary())
# Decile 1 (lowest cure) should have highest event rates.
# Decile 10 (highest cure) should have lowest event rates.

Insurance applications

UK motor: First at-fault claim in policy tenure. Event = first claim, time axis = tenure in months. Incidence covariates: NCB years, driver age, vehicle age, occupation. A policyholder with 9 years NCB is a plausible structural non-claimer; a first-year policyholder is not.

Pet insurance: First claim by condition type. Breed, age, indoor/outdoor status drive susceptibility. Indoor cats in early life have very high cure fractions for accidental injury.

Travel insurance: Single-trip non-claimers. Destination, duration, age, trip type (business vs leisure) drive susceptibility.

Where MCM does NOT apply: Buildings (flood, subsidence). Return periods exceed practical follow-up windows. The Qn test will reject sufficient follow-up. Use flood zone categories as structural zero covariates in a standard GLM instead.

Synthetic data

from insurance_cure.simulate import simulate_motor_panel, simulate_pet_panel

# Motor panel: multi-year structure with NCB, age, vehicle age
df = simulate_motor_panel(
    n_policies=5000,
    n_years=5,
    cure_fraction=0.40,
    weibull_shape=1.2,
    weibull_scale=36.0,    # months to first claim for susceptibles
    censoring_rate=0.15,   # annual lapse rate
    seed=42,
)

# Pet panel: cross-sectional
df_pet = simulate_pet_panel(n_policies=2000, cure_fraction=0.35, seed=42)

The true latent immune status is included as is_immune for validation. This column is not available in real data.

EM algorithm details

The EM algorithm decouples into two standard sub-problems at each iteration:

E-step: For censored observation i:

w_i = pi(z_i) * S_u(t_i|x_i) / [pi(z_i) * S_u(t_i|x_i) + (1 - pi(z_i))]

For observed events: w_i = 1 (certainly susceptible).

M-step:

  1. Logistic regression for gamma using w_i as soft labels
  2. Weighted Weibull/log-normal MLE for latency parameters, using w_i as case weights

The w_i weights are interpretable posterior susceptibility probabilities. This transparency is a key advantage over direct MLE of the full log-likelihood, which converges less reliably and provides no intermediate interpretation.

Design choices

EM over direct MLE. Direct MLE of the full MCM log-likelihood suffers from negative-definite Hessian problems near the boundaries (cure fraction near 0 or 1). EM converges monotonically. The M-step delegates to proven scipy/sklearn solvers for each sub-problem separately. This is the approach taken by smcure in R.

Separate incidence and latency formulae. Following smcure's cureform / formula convention. In practice, all covariates typically enter the incidence sub-model; only timing-relevant covariates enter the latency.

Multiple restarts. The MCM log-likelihood is multimodal, especially when the cure fraction is near 0 or 1. Five restarts (mix of smart and random initialisations) is a practical default. Increase for production models.

Bootstrap SEs. EM does not directly yield standard errors. The Louis (1982) observed information matrix requires second derivatives of the complete-data log-likelihood — numerically involved. Bootstrap (B=200) is the smcure default and is implemented here via joblib parallel.

References

  • Farewell (1982), Biometrics 38:1041-1046 — canonical covariate MCM
  • Maller & Zhou (1996), Survival Analysis with Long-Term Survivors, Wiley — identifiability, Qn test
  • Peng & Dear (2000), Biometrics 56:237-243 — EM algorithm, semiparametric
  • Sy & Taylor (2000), Biometrics 56:227-236 — EM algorithm, Cox latency
  • Tsodikov (1998), JRSS-B 60:195-207 — promotion time / non-mixture model

Burning Cost — actuarial Python for UK pricing teams.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insurance_cure-0.1.0.tar.gz (36.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

insurance_cure-0.1.0-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file insurance_cure-0.1.0.tar.gz.

File metadata

  • Download URL: insurance_cure-0.1.0.tar.gz
  • Upload date:
  • Size: 36.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_cure-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1d94ef74e9d52d223d9e16b3aaa4efa1563bb5ba49f79fa9d11a827142c23a88
MD5 3af3100d1c39403660b0de50a7b65966
BLAKE2b-256 2b876c957781fb504be632b631d4f2b40b90d1558f5ee044117fcc57aaba6f5a

See more details on using hashes here.

File details

Details for the file insurance_cure-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: insurance_cure-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_cure-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 41a0d91945ea784fe5a6b9680129ba539aa541896a980f43ead4e834b5984b63
MD5 3e575880e86d01c5032c677a367ca0e1
BLAKE2b-256 e6d03f0a751ab17dc9d243c832755c6b96946dc62f9b75f357becaddf2c2fbd6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page