Distil GBM models into multiplicative GLM factor tables for insurance rating engines.
Project description
insurance-distill
Distil GBM models into multiplicative GLM factor tables for insurance rating engines.
The problem
Your CatBoost model outperforms your GLM in Gini, but your rating engine (Radar, Emblem, or any multiplicative system) needs factor tables - not a black box. You cannot load a gradient boosted tree into Radar.
This library bridges that gap. It fits a Poisson or Gamma GLM using the GBM's predictions as the target (pseudo-predictions), bins continuous variables optimally, and exports the result as factor tables that a rating engine can consume directly.
The GLM surrogate will not match the GBM's Gini coefficient exactly. A well-tuned distillation typically retains 90-97% of the GBM's discrimination. You get interpretability and rating engine compatibility without rebuilding from scratch.
Installation
uv add insurance-distill
With CatBoost support:
uv add "insurance-distill[catboost]"
Quick start
from insurance_distill import SurrogateGLM
# fitted_catboost: any sklearn-compatible model (CatBoost, sklearn GBM, etc.)
surrogate = SurrogateGLM(
model=fitted_catboost,
X_train=X_train, # Polars DataFrame
y_train=y_train, # actual claim counts or amounts
exposure=exposure_arr, # earned car-years (or None for unit exposure)
family="poisson", # or "gamma" for severity
)
surrogate.fit(
max_bins=10, # bins per continuous variable
interaction_pairs=[("driver_age", "region")], # optional interaction terms
)
# Validation
report = surrogate.report()
print(report.metrics.summary())
# Gini (GBM): 0.3241
# Gini (GLM surrogate): 0.3087
# Gini ratio: 95.2%
# Deviance ratio: 0.9143
# Max segment deviation: 8.3%
# Mean segment deviation: 2.1%
# Segments evaluated: 312
# Inspect a single factor table
driver_age_table = surrogate.factor_table("driver_age")
print(driver_age_table)
# shape: (8, 3)
# | level | log_coefficient | relativity |
# | [-inf, 21.00) | 0.412 | 1.510 |
# | [21.00, 25.00) | 0.218 | 1.244 |
# ...
# Export all factor tables as CSV (one file per variable)
surrogate.export_csv("output/factors/", prefix="motor_freq_")
# Writes: motor_freq_driver_age.csv, motor_freq_vehicle_value.csv, ...
Binning strategies
Three binning methods are available. The default (tree) is the right choice for most variables.
| Method | Description | When to use |
|---|---|---|
tree |
CART decision tree on GBM pseudo-predictions | Default. Finds statistically meaningful cut-points. |
quantile |
Equal-frequency bins | Fallback when the tree produces degenerate splits. |
isotonic |
Change-points from isotonic regression | Monotone variables (e.g. no-claims discount, years held). |
You can mix methods per variable:
surrogate.fit(
max_bins=10,
binning_method="tree",
method_overrides={
"ncd_years": "isotonic",
"vehicle_age": "quantile",
},
)
Validation metrics
After fitting, surrogate.report() returns a DistillationReport with:
- Gini ratio: how much of the GBM's discrimination the GLM retains. Above 0.90 is generally acceptable; above 0.95 is excellent.
- Deviance ratio: analogous to R-squared for GLMs. Measures how well the GLM explains the GBM's predictions.
- Max segment deviation: maximum relative difference between GBM and GLM, across all combinations of binned levels. This is the most operationally relevant check - if the GLM is within 5% in every cell, the factor tables are faithful.
- Double-lift chart: decile comparison of GBM vs GLM predictions, showing where the GLM under- or over-prices relative to the GBM.
Design choices
Why glum, not statsmodels? glum is purpose-built for the kind of large, sparse GLMs that insurance pricing produces. It is 10-100x faster than statsmodels for problems with many one-hot encoded features, and it handles L1/L2 regularisation natively. The coefficient estimates are identical to statsmodels for the unregularised case.
Why Polars? We use Polars for data handling because it is faster and more memory-efficient than pandas for the aggregation operations (segment deviation, lift charts) that this library relies on. The GLM fitting itself uses numpy arrays internally, as glum requires.
Why pseudo-predictions, not actual claims? Fitting the GLM on GBM predictions rather than actual claims eliminates the noise from individual claim events. The GBM has already smoothed over that noise. Fitting the surrogate on the GBM's output gives a cleaner signal for the GLM to learn from, resulting in better-preserved Gini.
Multiplicative by construction The GLM always uses a log link function. This means the factor tables are multiplicative: the final premium is the product of the base rate and each factor. This is the convention used by Radar, Emblem, Guidewire, and most other UK personal lines rating engines.
Factor table format
Each factor table is a Polars DataFrame with three columns:
| Column | Type | Description |
|---|---|---|
level |
str | Bin label (e.g. [25.00, 40.00)) or category value |
log_coefficient |
float | Raw GLM coefficient on log scale (0.0 for base level) |
relativity |
float | Multiplicative factor = exp(log_coefficient) |
The base level (reference category) always has relativity = 1.0. All other levels are expressed relative to it.
Requirements
- Python >= 3.10
- polars >= 0.20
- numpy >= 1.24
- scikit-learn >= 1.3
- glum >= 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file insurance_distill-0.1.2.tar.gz.
File metadata
- Download URL: insurance_distill-0.1.2.tar.gz
- Upload date:
- Size: 145.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c875fa1728fd9c0df1e7c0821d1f5ca5ab013c061f244b33acf85661bc12766
|
|
| MD5 |
7c1e4d2145471b22c485bd863cf2d266
|
|
| BLAKE2b-256 |
e9daaad088827b0240269c1370c9f8c7a7e340255ebcf0f76e7a8c1dfc2f8039
|
File details
Details for the file insurance_distill-0.1.2-py3-none-any.whl.
File metadata
- Download URL: insurance_distill-0.1.2-py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0fd1207984036a5154920b8b7ab9277f491864a22bb20f8d42a25ca4f6aeec4
|
|
| MD5 |
2add7b47344697e0f439e306870b6cfb
|
|
| BLAKE2b-256 |
1b2f62de44b0f61d5e7fc3205f0e3f1c7f3c192af4d6740c5a05c03a95949fa6
|