Skip to main content

Distil GBM models into multiplicative GLM factor tables for insurance rating engines.

Project description

insurance-distill

Open In Colab

PyPI Downloads Python Tests License

Distil GBM models into multiplicative GLM factor tables for insurance rating engines.

Blog post: From CatBoost to Radar in 50 Lines of Python

The problem

Your CatBoost model outperforms your GLM in Gini, but your rating engine (Radar, Emblem, or any multiplicative system) needs factor tables — not a black box. You cannot load a gradient boosted tree into Radar.

This library bridges that gap. It fits a Poisson or Gamma GLM using the GBM's predictions as the target (pseudo-predictions), bins continuous variables optimally, and exports the result as factor tables that a rating engine can consume directly.

The GLM surrogate will not match the GBM's Gini coefficient exactly. In our testing on synthetic UK motor data, the surrogate GLM retained 90–97% of the GBM's Gini coefficient; results vary by DGP complexity and number of features. You get interpretability and rating engine compatibility without rebuilding from scratch.

Why bother

Benchmarked on synthetic UK motor data — 30,000 policies, 7 rating factors (driver age, vehicle value, NCD years, vehicle age, annual mileage, region, vehicle group), Poisson frequency model. CatBoost trained for 300 iterations; surrogate GLM with CART binning (max 10 bins per continuous variable). See benchmarks/benchmark.py for the full script; numbers in this table are illustrative of the range produced on this synthetic DGP.

Metric Direct GLM (fitted on claims) GBM Surrogate GLM (this library)
Gini coefficient 0.2851 0.3087
Gini ratio vs CatBoost (0.3241) 88.0% 95.2%
Deviance ratio 0.8412 0.9143
Max segment deviation 8.3%
Factor table export Manual binning required Automatic, one call
Rating engine compatible Yes Yes

The surrogate GLM recovers 95.2% of the GBM's Gini coefficient. A direct GLM fitted on the raw claims data achieves 88.0%. The 7-point difference is the noise reduction from fitting on GBM pseudo-predictions rather than individual claim events — the GBM has already smoothed the variance away.

Run on Databricks


Installation

uv add insurance-distill
# or
pip install insurance-distill

With CatBoost support:

uv add "insurance-distill[catboost]"

Questions or feedback? Start a Discussion. Found it useful? A star helps others find it.

Quick start

from insurance_distill import SurrogateGLM

# fitted_catboost: any sklearn-compatible model (CatBoost, sklearn GBM, etc.)
surrogate = SurrogateGLM(
    model=fitted_catboost,
    X_train=X_train,          # Polars DataFrame
    y_train=y_train,          # actual claim counts or amounts
    exposure=exposure_arr,    # earned car-years (or None for unit exposure)
    family="poisson",         # or "gamma" for severity
)

surrogate.fit(
    max_bins=10,                                   # bins per continuous variable
    interaction_pairs=[("driver_age", "region")],  # optional interaction terms
)

# Validation
report = surrogate.report()
print(report.metrics.summary())
# Gini (GBM):              0.3241
# Gini (GLM surrogate):    0.3087
# Gini ratio:              95.2%
# Deviance ratio:          0.9143
# Max segment deviation:   8.3%
# Mean segment deviation:  2.1%
# Segments evaluated:      312

# Inspect a single factor table
driver_age_table = surrogate.factor_table("driver_age")
print(driver_age_table)
# shape: (8, 3)
# | level              | log_coefficient | relativity |
# | [-inf, 21.00)      | 0.412           | 1.510      |
# | [21.00, 25.00)     | 0.218           | 1.244      |
# ...

# Export all factor tables as CSV (one file per variable)
surrogate.export_csv("output/factors/", prefix="motor_freq_")
# Writes: motor_freq_driver_age.csv, motor_freq_vehicle_value.csv, ...

Binning strategies

Three binning methods are available. The default (tree) is the right choice for most variables.

Method Description When to use
tree CART decision tree on GBM pseudo-predictions Default. Finds statistically meaningful cut-points.
quantile Equal-frequency bins Fallback when the tree produces degenerate splits.
isotonic Change-points from isotonic regression Monotone variables (e.g. no-claims discount, years held).

You can mix methods per variable:

surrogate.fit(
    max_bins=10,
    binning_method="tree",
    method_overrides={
        "ncd_years": "isotonic",
        "vehicle_age": "quantile",
    },
)

Validation metrics

After fitting, surrogate.report() returns a DistillationReport with:

  • Gini ratio: how much of the GBM's discrimination the GLM retains. Above 0.90 is generally acceptable; above 0.95 is excellent.
  • Deviance ratio: analogous to R-squared for GLMs. Measures how well the GLM explains the GBM's predictions.
  • Max segment deviation: maximum relative difference between GBM and GLM, across all combinations of binned levels. This is the most operationally relevant check — if the GLM is within 5% in every cell, the factor tables are faithful.
  • Double-lift chart: decile comparison of GBM vs GLM predictions, showing where the GLM under- or over-prices relative to the GBM.

Expected Performance

Validated on synthetic UK motor frequency data with known true DGP (30,000-50,000 policies, 5 continuous rating factors + 1 categorical, CatBoost 300 iterations). Results from notebooks/02_validation_gini_retention.py (Databricks serverless, seed=42).

Model Gini coefficient Gini ratio vs CatBoost
CatBoost (GBM) ~0.30-0.34 100% (reference)
Surrogate GLM (this library, max_bins=10) ~0.28-0.33 90-97%
Direct GLM (fitted on raw claims) ~0.26-0.30 85-91%

Key findings from the validation notebook:

  • Surrogate GLM consistently outperforms direct GLM by 3-6 Gini ratio points. The reason: the surrogate learns from CatBoost's denoised predictions rather than individual claim counts. Poisson observation noise is already filtered out by the GBM; the GLM inherits that benefit.
  • Max segment deviation with max_bins=10 is typically 6-9%. Below 10% is acceptable for most rating engines; below 5% allows direct loading without manual review.
  • Gini ratio degrades gracefully with fewer bins: max_bins=5 costs roughly 3-5 Gini ratio points vs max_bins=10. There is no free lunch — simpler factor tables mean some discriminatory power is lost.
  • NCD years with isotonic binning produces monotone decreasing relativities, as expected. Tree binning on NCD occasionally produces non-monotone artefacts; isotonic binning eliminates this.
  • Factor tables export correctly: one CSV per variable, multiplicative relativities, base level at 1.0. Format is directly compatible with Radar and Emblem.

Effect of max_bins on Gini retention (typical range on this DGP):

max_bins Gini ratio vs CatBoost Max segment deviation
5 ~87-92% ~10-15%
7 ~90-94% ~8-12%
10 ~92-97% ~6-9%
15 ~93-97% ~5-8%

10 bins is the inflection point: adding more bins yields diminishing Gini returns while increasing factor table complexity.

The honest caveat: the 90-97% range assumes a well-specified DGP. Books with very high interaction effects not captured in interaction_pairs, or with extreme exposure imbalances, can fall below 90%. Always inspect report.metrics.summary() — if the Gini ratio is below 88%, add interaction terms before signing off the tables.

Computational Performance

Benchmarked on Databricks serverless compute. All timings use the default tree binning strategy.

Task n=10,000 n=50,000 n=250,000
SurrogateGLM.fit() (6 continuous + 1 categorical) 0.4s 1.8s 9.1s
surrogate.report() (all metrics) 0.2s 0.6s 2.9s
surrogate.export_csv() (7 factor tables) < 0.1s < 0.1s < 0.1s
Full workflow end-to-end 0.7s 2.5s 12.3s

The dominant cost is the GLM fit in glum, which scales roughly linearly with rows. For portfolios above 500,000 policies you can pass a stratified subsample to SurrogateGLM for fitting and run report() on the full dataset — the factor tables are evaluated on all data regardless.

Gini ratio by number of bins (max_bins): fewer bins reduces the GLM's degrees of freedom and lowers the Gini ratio. Ten bins is a reasonable default for most continuous variables. Dropping to five bins typically costs 2–4 Gini ratio points.

Design choices

Why glum, not statsmodels? glum is purpose-built for the kind of large, sparse GLMs that insurance pricing produces. It is 10–100x faster than statsmodels for problems with many one-hot encoded features, and it handles L1/L2 regularisation natively. The coefficient estimates are identical to statsmodels for the unregularised case.

Why Polars? We use Polars for data handling because it is faster and more memory-efficient than pandas for the aggregation operations (segment deviation, lift charts) that this library relies on. The GLM fitting itself uses numpy arrays internally, as glum requires.

Why pseudo-predictions, not actual claims? Fitting the GLM on GBM predictions rather than actual claims eliminates the noise from individual claim events. The GBM has already smoothed over that noise. Fitting the surrogate on the GBM's output gives a cleaner signal for the GLM to learn from, resulting in better-preserved Gini.

Multiplicative by construction The GLM always uses a log link function. This means the factor tables are multiplicative: the final premium is the product of the base rate and each factor. This is the convention used by Radar, Emblem, Guidewire, and most other UK personal lines rating engines.

Factor table format

Each factor table is a Polars DataFrame with three columns:

Column Type Description
level str Bin label (e.g. [25.00, 40.00)) or category value
log_coefficient float Raw GLM coefficient on log scale (0.0 for base level)
relativity float Multiplicative factor = exp(log_coefficient)

The base level (reference category) always has relativity = 1.0. All other levels are expressed relative to it.

Limitations

Gini ratio is not guaranteed. The 90–97% range cited is typical for well-structured motor and property books. Books with very high feature cardinality, strong non-linear interactions, or thin exposure in key segments can produce lower Gini ratios. Always inspect report.metrics.summary() before signing off the factor tables.

Interactions must be specified manually. The surrogate GLM does not automatically discover interaction terms. If the GBM is relying on a strong driver_age x region interaction, you need to pass interaction_pairs=[("driver_age", "region")] explicitly. Failure to do so will result in the GLM approximating the marginal effects only, and the max segment deviation will be high for those combinations.

Categorical variables with high cardinality. For categoricals with more than 30 levels (e.g. vehicle make, detailed occupation), the GLM will have many parameters and may overfit the GBM pseudo-predictions. Consider regrouping rare levels before fitting, or using regularisation via alpha_l2=.

No temporal validation. The surrogate fits on training data and is validated on the same data by default. For motor pricing, pass a held-out period (most recent accident year) to surrogate.report(X_val=, y_val=, exposure_val=) to confirm the factor tables generalise.

Rating engine rounding. Factor tables exported to CSV are stored at full floating-point precision. Most rating engines round to 3–4 decimal places. Rounding at the factor level can accumulate multiplicatively across many factors. Validate the rounded factors against the GLM predictions before loading into production.

glum regularisation defaults to zero. The default fit is unregularised. If the surrogate GLM is overfitting thin segments, pass alpha_l1= or alpha_l2= to surrogate.fit(). The regularisation path is not searched automatically.

No .predict() method on SurrogateGLM. The class does not yet expose a .predict(X) method for out-of-sample scoring. The benchmark script works around this by calling OptimalBinner.transform() and constructing the design matrix directly. A formal .predict() method is on the roadmap; until then, use the pattern in benchmarks/benchmark.py::predict_surrogate_on_holdout().

Requirements

  • Python >= 3.10
  • polars >= 0.20
  • numpy >= 1.24
  • scikit-learn >= 1.3
  • glum >= 2.0

References

  • Noll, A., Salzmann, R., & Wuthrich, M. V. (2020). Case study: French motor third-party liability claims. SSRN 3164764. The canonical reference for GBM-to-GLM distillation in insurance, demonstrating the pseudo-prediction approach on real French MTPL data.
  • Wuthrich, M. V., & Buser, C. (2023). Data analytics for non-life insurance pricing. RiskLab, ETH Zurich. Chapters 7-9 cover GLM surrogate methodology and factor table validation.
  • Yang, Y., Qian, W., & Zou, H. (2018). Insurance premium prediction via gradient tree-boosted Tweedie compound Poisson models. Journal of Business & Economic Statistics, 36(3), 456-470. Background on GBM frequency/severity models that precede distillation.
  • Lindholm, M., & Verrall, R. (2020). Regression models for non-life insurance pricing: Generalised linear models and beyond. Annals of Actuarial Science, 14(2), 370-399. Covers multiplicative GLM structure and rating factor interpretation.

Related libraries

Library What it does
shap-relativities Extract rating relativities directly from a GBM using SHAP — an alternative to distillation when you don't need rating engine compatibility
insurance-causal Establish whether each rating factor causally drives risk before committing it to the factor table
insurance-fairness Proxy discrimination auditing — run this before distilling to identify which factors should not be in the GLM

Other Burning Cost libraries

Model building

Library Description
shap-relativities Extract rating relativities from GBMs using SHAP
insurance-interactions Automated GLM interaction detection via CANN and NID scores
insurance-cv Walk-forward cross-validation respecting IBNR structure

Uncertainty quantification

Library Description
insurance-conformal Distribution-free prediction intervals for Tweedie models
bayesian-pricing Hierarchical Bayesian models for thin-data segments
insurance-credibility Bühlmann-Straub credibility weighting

Deployment and optimisation

Library Description
insurance-optimise Constrained rate change optimisation with FCA PS21/5 compliance
insurance-causal Causal inference — establishes whether a rating factor causally drives risk or is a proxy for a protected characteristic
insurance-fairness Proxy discrimination auditing for UK insurance models

Governance

Library Description
insurance-governance PRA SS1/23 model validation reports
insurance-monitoring Model monitoring: PSI, A/E ratios, Gini drift test

All libraries and blog posts ->


Community

If this library saves you time, a star on GitHub helps others find it.

Licence

MIT. See LICENSE.


Need help implementing this in production? Talk to us.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insurance_distill-0.1.6.tar.gz (188.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

insurance_distill-0.1.6-py3-none-any.whl (34.5 kB view details)

Uploaded Python 3

File details

Details for the file insurance_distill-0.1.6.tar.gz.

File metadata

  • Download URL: insurance_distill-0.1.6.tar.gz
  • Upload date:
  • Size: 188.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for insurance_distill-0.1.6.tar.gz
Algorithm Hash digest
SHA256 5ebd6c41116b7c88e173e8f70a3204a127e1d25f112c097d8347ca534820e6f3
MD5 f2df2a699f324925ef0611612e292dca
BLAKE2b-256 8ec82774c7f3edccdd192eab3837eac5abbc471dad5d8fd9eaa21f8457d97586

See more details on using hashes here.

File details

Details for the file insurance_distill-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for insurance_distill-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 2be5ed9e681c0220794c60154dcfbe7536bb8b4681b4d3220153008011e044f5
MD5 8c135517e7acd316d16638b424808254
BLAKE2b-256 52f0d1d8e4320eb87a48fa61fc37725e5cfd2202a6fe45a9c8b8bb5bb01b9cdf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page