Zero-Inflated Tweedie Double GLM with CatBoost gradient boosting for insurance pricing
Project description
insurance-zit-dglm
Zero-Inflated Tweedie Double GLM with CatBoost gradient boosting, built for UK insurance pricing.
The problem
Standard compound Poisson Tweedie models handle the probability of zero claims through the Poisson component: Pr(Y=0) = exp(-lambda). This is fine when all policyholders are genuinely exposed — they just happen to have no claims in the period.
UK personal lines has a structural problem that breaks this assumption: strategic non-claimers. Under the No Claims Discount system, policyholders without protected NCD face up to 65% premium uplift for a single claim. For repairs costing less than the excess plus NCD impact, rational policyholders never claim. These are not Poisson draws — they are a distinct regime with genuinely zero claim probability.
The same phenomenon appears in home accidental damage (below-excess events unreported), motor fleet (seasonal/off-road vehicles), and subsidence (geological zero-risk properties).
The standard Tweedie conflates these two regimes. A hurdle (two-part) model goes too far in the other direction — it treats all zeros as structural, losing the compound Poisson structure that gives you the correct aggregate loss distribution.
Zero-Inflated Tweedie is the middle ground: some zeros are structural, the rest are Poisson draws. This library implements it.
The model
The ZIT distribution mixes a point mass at zero with a standard compound Poisson-Gamma Tweedie:
f(y) = q * I(y=0) + (1-q) * Tweedie(mu, phi, p)
Parameters:
qin [0,1]: structural zero probability (learned per-policy from features)mu> 0: Tweedie mean conditional on non-structural-zerophi> 0: Tweedie dispersion (DGLM extension: this is covariate-driven, not fixed)pin (1,2): Tweedie power parameter
Expected aggregate loss: E[Y] = (1-q) * mu
Three separate CatBoost models are fitted inside an EM loop (Gu arXiv:2405.14990):
- Mean head (log link): ZIT Tweedie custom loss with exposure-weighted gradients
- Dispersion head (log link): Smyth-Jorgensen gamma pseudo-likelihood on unit deviances
- Zero-inflation head (logit link): EM-weighted logistic regression with soft labels
The EM algorithm handles the unobserved indicator z_i (whether observation i is a structural zero). E-step computes posterior Pi_i = P(z_i=1 | y_i=0, x_i). M-step updates all three models with EM weights that down-weight observations likely to be structural zeros.
Why the DGLM matters: the E-step depends on mu^(2-p) / (phi*(2-p)). If phi is misspecified as constant, the posterior weights Pi_i are wrong, contaminating all three models through the EM loop. Modelling phi as covariate-driven is not optional.
Installation
pip install insurance-zit-dglm
Quick start
import polars as pl
from insurance_zit_dglm import ZITModel, ZITReport, check_balance
# Fit
model = ZITModel(
tweedie_power=1.5,
n_estimators=200,
em_iterations=20,
exposure_col="exposure_years",
)
model.fit(X_train, y_train)
# Predict aggregate expected loss E[Y] = (1-q)*mu
e_y = model.predict(X_test)
# All components
components = model.predict_components(X_test)
# components: mu, phi, q, E_Y
# Full P(Y=0) = q + (1-q)*exp(-mu^(2-p)/(phi*(2-p)))
prob_zero = model.predict_proba_zero(X_test)
# Balance check
result = check_balance(model, X_test, y_test, groups=age_band_series)
print(result.ratio) # sum(E[Y]) / sum(y)
print(result.is_balanced)
Diagnostic reports
report = ZITReport(model)
# Calibration: observed vs predicted E[Y] by decile
report.calibration_plot(X_test, y_test)
# Zero calibration: Pr(Y=0) predicted vs empirical
report.zero_calibration_plot(X_test, y_test)
# Dispersion diagnostic: D(y;mu)/phi should be ~1
report.dispersion_plot(X_test, y_test)
# Lorenz curve and Gini
fig, gini = report.lorenz_curve(X_test, y_test)
# Vuong test: is ZIT significantly better than standard Tweedie?
from insurance_zit_dglm import ZITModel
tweedie_only = ZITModel(tweedie_power=1.5)
tweedie_only.fit(X_train, y_train)
result = report.vuong_test(model, tweedie_only, X_test, y_test)
print(result.preferred_model) # 'model_1' | 'model_2' | 'indeterminate'
# Feature importance per head
report.feature_importance("mean") # mu model
report.feature_importance("dispersion") # phi model
report.feature_importance("zero") # pi model
Link scenarios
Independent (default, recommended): three separate trees for mu, phi, and pi. The most general form — no structural relationship assumed between q and mu.
model = ZITModel(link_scenario="independent")
Linked: single tree for mu; q derived as q = 1/(1 + mu^gamma). This enforces the economic intuition that higher-risk policies are less likely to be structural zeros (So & Valdez arXiv:2406.16206 Scenario 2). If gamma=None, it is estimated by grid search.
model = ZITModel(link_scenario="linked", gamma=1.0)
Power parameter
The Tweedie power p is not gradient-boosted — it is estimated separately by profile likelihood. Use estimate_power() to select it before fitting:
from insurance_zit_dglm import estimate_power
# Quick grid search with initial mu estimates
p_hat = estimate_power(y_train.to_numpy(), mu_initial, p_grid=[1.2, 1.3, 1.4, 1.5, 1.6, 1.7])
model = ZITModel(tweedie_power=p_hat)
Autocalibration
Gradient boosting minimising ZIT deviance does not automatically satisfy the balance property sum(E[Y_i]) = sum(y_i). For FCA Consumer Duty compliance, check this explicitly:
result = check_balance(model, X_val, y_val, tolerance=0.02)
if not result.is_balanced:
from insurance_zit_dglm import recalibrate
recal_model = recalibrate(model, X_val, y_val)
# recal_model applies a multiplicative intercept correction
Mathematical foundation
- Gu (arXiv:2405.14990): ZIT with dispersion modelling and generalised EM algorithm
- So & Valdez (arXiv:2406.16206 / NAAJ Vol 29(4):887-904, 2025): ZIT boosted trees, CatBoost implementation, Vuong test
- Delong & Wuthrich (arXiv:2103.03635): balance property and autocalibration
UK peril guidance
| Peril | ZIT recommended? | Reason |
|---|---|---|
| Motor AD (non-protected NCD) | Yes | NCD behavioural zeros |
| Home accidental damage | Yes | Sub-excess strategic non-claiming |
| Subsidence | Yes | Geological regime effect |
| Commercial fleet | Yes | Seasonal/off-road structural zeros |
| Comprehensive motor (protected NCD) | Marginal | Standard Tweedie often sufficient |
| Home escape of water | No | Genuine compound Poisson |
| Motor windscreen | No | Low excess, few strategic zeros |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file insurance_zit_dglm-0.1.0.tar.gz.
File metadata
- Download URL: insurance_zit_dglm-0.1.0.tar.gz
- Upload date:
- Size: 38.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab221d77afef2942af28218c755a3a40bee46ce3643fa829029ee9b8d2fe5b0b
|
|
| MD5 |
553260049d4acc3b3b6e6c19fcf7ae7a
|
|
| BLAKE2b-256 |
21676385e2843e4937ac52e0743163dc44308cfb76e1396b6476f48b899ea144
|
File details
Details for the file insurance_zit_dglm-0.1.0-py3-none-any.whl.
File metadata
- Download URL: insurance_zit_dglm-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f21e0b2d70d2ef63be6f05847c9cb46178f20596d8fbe64684220557d6b97ab3
|
|
| MD5 |
451aa6e49f5101aa9bf6a9baed8804bd
|
|
| BLAKE2b-256 |
fd6f60dc8d92d3e920fae297aa59ad732d6f8396ab109f9178d57e52209f0f32
|