Mixture cure models for insurance non-claimer scoring: covariate-aware logistic incidence, parametric and semiparametric latency, EM estimation

These details have not been verified by PyPI

Project description

insurance-cure

Mixture cure models for insurance non-claimer scoring.

The problem

Frequency GLMs treat all zero-claim policyholders the same. They do not distinguish between:

Structural non-claimers — policyholders who would never claim regardless of how long you observed them. A 60-year-old with 9 years NCB driving 5,000 miles a year.
Lucky susceptibles — policyholders who are genuinely at risk but happened not to claim this year.

These two groups behave differently over multi-year retention horizons. The structural immune cohort will never generate claim cost regardless of tenure. The low-hazard susceptible will eventually claim.

A Poisson GLM cannot tell them apart. A mixture cure model (MCM) can.

What this library does

insurance-cure fits covariate-aware MCMs with a logistic incidence sub-model (who is susceptible?) and a parametric or semiparametric latency sub-model (when do susceptibles claim?). The primary output is a per-policyholder susceptibility score.

The population survival function:

S_pop(t | x, z) = pi(z) * S_u(t | x) + [1 - pi(z)]

pi(z) = P(susceptible), logistic regression incidence sub-model
S_u(t | x) = survival for susceptibles, Weibull/log-normal/Cox latency
[1 - pi(z)] = cure fraction: P(never experiences event)

Estimation via EM algorithm (Peng & Dear 2000; Sy & Taylor 2000). Multiple restarts to handle multimodality. Bootstrap standard errors available.

No other pip-installable Python package provides covariate-aware MCM with actuarial output. R has smcure, flexsurvcure, cuRe. Python has nothing. This fills that gap.

Installation

pip install insurance-cure

Dependencies: numpy, scipy, pandas, scikit-learn, lifelines, joblib.

Quick start

import pandas as pd
from insurance_cure import WeibullMixtureCure
from insurance_cure.diagnostics import sufficient_followup_test, CureScorecard
from insurance_cure.simulate import simulate_motor_panel

# Generate synthetic motor panel with known cure fraction 40%
df = simulate_motor_panel(n_policies=3000, cure_fraction=0.40, seed=42)

# ALWAYS check sufficient follow-up before fitting
qn = sufficient_followup_test(df["tenure_months"], df["claimed"])
print(qn.summary())

# Fit Weibull MCM
model = WeibullMixtureCure(
    incidence_formula="ncb_years + age + vehicle_age",
    latency_formula="ncb_years + age",
    n_em_starts=5,
)
model.fit(df, duration_col="tenure_months", event_col="claimed")
print(model.result_.summary())

# Outputs
cure_scores = model.predict_cure_fraction(df)      # P(immune) per policy
suscept = model.predict_susceptibility(df)          # 1 - cure_fraction
pop_surv = model.predict_population_survival(df, times=[12, 24, 36, 60])

# Validate with scorecard
scorecard = CureScorecard(model, bins=10).fit(df, duration_col="tenure_months", event_col="claimed")
print(scorecard.summary())

Models

WeibullMixtureCure (recommended)

Weibull AFT latency. Clean parametric extrapolation. Best default choice.

from insurance_cure import WeibullMixtureCure

model = WeibullMixtureCure(
    incidence_formula="ncb_years + age + vehicle_age",
    latency_formula="ncb_years + age",
    n_em_starts=5,        # EM restarts — use >=5 for production
    bootstrap_se=True,    # Bootstrap SEs — slow but rigorous
    n_bootstrap=200,
    n_jobs=-1,
)
model.fit(df, duration_col="tenure_months", event_col="claimed")

LogNormalMixtureCure

Log-normal latency. Better when the conditional hazard peaks then falls — sometimes fits pet or travel data better than Weibull.

from insurance_cure import LogNormalMixtureCure

model = LogNormalMixtureCure(
    incidence_formula="pet_age + breed_risk + indoor",
    latency_formula="pet_age + breed_risk",
)
model.fit(df)

CoxMixtureCure

Semiparametric Cox PH latency. Nonparametric baseline hazard — most flexible. Cannot extrapolate beyond the observation window. Use for exploration, not production pricing projection.

from insurance_cure import CoxMixtureCure

model = CoxMixtureCure(
    incidence_formula="ncb_years + age",
    latency_formula="ncb_years",
)
model.fit(df)

PromotionTimeCure

Non-mixture (promotion time) cure model. Population-level proportional hazards structure. Include as comparison model. The cure fraction emerges from the asymptote; there is no explicit incidence sub-model.

from insurance_cure import PromotionTimeCure

model = PromotionTimeCure(formula="ncb_years + age + vehicle_age")
model.fit(df)

Diagnostics

Sufficient follow-up test

The Maller-Zhou Qn test is mandatory. If the observation window is too short, many censored policyholders are simply susceptibles who have not yet claimed, not structural non-claimers. The cure fraction estimate will be upwardly biased.

from insurance_cure.diagnostics import sufficient_followup_test

result = sufficient_followup_test(df["tenure_months"], df["claimed"])
print(result.summary())
# Maller-Zhou Sufficient Follow-Up Test
# ========================================
#   Qn statistic      : 3.2194
#   p-value           : 0.0006
#   ...
#   Conclusion: Sufficient follow-up: evidence for a genuine cure fraction.

Cure scorecard

from insurance_cure.diagnostics import CureScorecard

scorecard = CureScorecard(model, bins=10).fit(df)
print(scorecard.summary())
# Decile 1 (lowest cure) should have highest event rates.
# Decile 10 (highest cure) should have lowest event rates.

Insurance applications

UK motor: First at-fault claim in policy tenure. Event = first claim, time axis = tenure in months. Incidence covariates: NCB years, driver age, vehicle age, occupation. A policyholder with 9 years NCB is a plausible structural non-claimer; a first-year policyholder is not.

Pet insurance: First claim by condition type. Breed, age, indoor/outdoor status drive susceptibility. Indoor cats in early life have very high cure fractions for accidental injury.

Travel insurance: Single-trip non-claimers. Destination, duration, age, trip type (business vs leisure) drive susceptibility.

Where MCM does NOT apply: Buildings (flood, subsidence). Return periods exceed practical follow-up windows. The Qn test will reject sufficient follow-up. Use flood zone categories as structural zero covariates in a standard GLM instead.

Synthetic data

from insurance_cure.simulate import simulate_motor_panel, simulate_pet_panel

# Motor panel: multi-year structure with NCB, age, vehicle age
df = simulate_motor_panel(
    n_policies=5000,
    n_years=5,
    cure_fraction=0.40,
    weibull_shape=1.2,
    weibull_scale=36.0,    # months to first claim for susceptibles
    censoring_rate=0.15,   # annual lapse rate
    seed=42,
)

# Pet panel: cross-sectional
df_pet = simulate_pet_panel(n_policies=2000, cure_fraction=0.35, seed=42)

The true latent immune status is included as is_immune for validation. This column is not available in real data.

EM algorithm details

The EM algorithm decouples into two standard sub-problems at each iteration:

E-step: For censored observation i:

w_i = pi(z_i) * S_u(t_i|x_i) / [pi(z_i) * S_u(t_i|x_i) + (1 - pi(z_i))]

For observed events: w_i = 1 (certainly susceptible).

M-step:

Logistic regression for gamma using w_i as soft labels
Weighted Weibull/log-normal MLE for latency parameters, using w_i as case weights

The w_i weights are interpretable posterior susceptibility probabilities. This transparency is a key advantage over direct MLE of the full log-likelihood, which converges less reliably and provides no intermediate interpretation.

Design choices

EM over direct MLE. Direct MLE of the full MCM log-likelihood suffers from negative-definite Hessian problems near the boundaries (cure fraction near 0 or 1). EM converges monotonically. The M-step delegates to proven scipy/sklearn solvers for each sub-problem separately. This is the approach taken by smcure in R.

Separate incidence and latency formulae. Following smcure's cureform / formula convention. In practice, all covariates typically enter the incidence sub-model; only timing-relevant covariates enter the latency.

Multiple restarts. The MCM log-likelihood is multimodal, especially when the cure fraction is near 0 or 1. Five restarts (mix of smart and random initialisations) is a practical default. Increase for production models.

Bootstrap SEs. EM does not directly yield standard errors. The Louis (1982) observed information matrix requires second derivatives of the complete-data log-likelihood — numerically involved. Bootstrap (B=200) is the smcure default and is implemented here via joblib parallel.

References

Farewell (1982), Biometrics 38:1041-1046 — canonical covariate MCM
Maller & Zhou (1996), Survival Analysis with Long-Term Survivors, Wiley — identifiability, Qn test
Peng & Dear (2000), Biometrics 56:237-243 — EM algorithm, semiparametric
Sy & Taylor (2000), Biometrics 56:227-236 — EM algorithm, Cox latency
Tsodikov (1998), JRSS-B 60:195-207 — promotion time / non-mixture model

Burning Cost — actuarial Python for UK pricing teams.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insurance_cure-0.1.0.tar.gz (36.6 kB view details)

Uploaded Mar 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

insurance_cure-0.1.0-py3-none-any.whl (33.5 kB view details)

Uploaded Mar 11, 2026 Python 3

File details

Details for the file insurance_cure-0.1.0.tar.gz.

File metadata

Download URL: insurance_cure-0.1.0.tar.gz
Upload date: Mar 11, 2026
Size: 36.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_cure-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1d94ef74e9d52d223d9e16b3aaa4efa1563bb5ba49f79fa9d11a827142c23a88`
MD5	`3af3100d1c39403660b0de50a7b65966`
BLAKE2b-256	`2b876c957781fb504be632b631d4f2b40b90d1558f5ee044117fcc57aaba6f5a`

See more details on using hashes here.

File details

Details for the file insurance_cure-0.1.0-py3-none-any.whl.

File metadata

Download URL: insurance_cure-0.1.0-py3-none-any.whl
Upload date: Mar 11, 2026
Size: 33.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_cure-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`41a0d91945ea784fe5a6b9680129ba539aa541896a980f43ead4e834b5984b63`
MD5	`3e575880e86d01c5032c677a367ca0e1`
BLAKE2b-256	`e6d03f0a751ab17dc9d243c832755c6b96946dc62f9b75f357becaddf2c2fbd6`

See more details on using hashes here.

insurance-cure 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

insurance-cure

The problem

What this library does

Installation

Quick start

Models

WeibullMixtureCure (recommended)

LogNormalMixtureCure

CoxMixtureCure

PromotionTimeCure

Diagnostics

Sufficient follow-up test

Cure scorecard

Insurance applications

Synthetic data

EM algorithm details

Design choices

References

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes