Distil GBM models into multiplicative GLM factor tables for insurance rating engines.

These details have not been verified by PyPI

Project links

Project description

insurance-distill

Distil GBM models into multiplicative GLM factor tables for insurance rating engines.

Blog post: From CatBoost to Radar in 50 Lines of Python

The problem

Your CatBoost model outperforms your GLM in Gini, but your rating engine (Radar, Emblem, or any multiplicative system) needs factor tables — not a black box. You cannot load a gradient boosted tree into Radar.

This library bridges that gap. It fits a Poisson or Gamma GLM using the GBM's predictions as the target (pseudo-predictions), bins continuous variables optimally, and exports the result as factor tables that a rating engine can consume directly.

The GLM surrogate will not match the GBM's Gini coefficient exactly. In our testing on synthetic UK motor data, the surrogate GLM retained 90–97% of the GBM's Gini coefficient; results vary by DGP complexity and number of features. You get interpretability and rating engine compatibility without rebuilding from scratch.

Why bother

Benchmarked on synthetic UK motor data — 30,000 policies, 7 rating factors (driver age, vehicle value, NCD years, vehicle age, annual mileage, region, vehicle group), Poisson frequency model. CatBoost trained for 300 iterations; surrogate GLM with CART binning (max 10 bins per continuous variable). See benchmarks/benchmark.py for the full script; numbers in this table are illustrative of the range produced on this synthetic DGP.

Metric	Direct GLM (fitted on claims)	GBM Surrogate GLM (this library)
Gini coefficient	0.2851	0.3087
Gini ratio vs CatBoost (0.3241)	88.0%	95.2%
Deviance ratio	0.8412	0.9143
Max segment deviation	—	8.3%
Factor table export	Manual binning required	Automatic, one call
Rating engine compatible	Yes	Yes

The surrogate GLM recovers 95.2% of the GBM's Gini coefficient. A direct GLM fitted on the raw claims data achieves 88.0%. The 7-point difference is the noise reduction from fitting on GBM pseudo-predictions rather than individual claim events — the GBM has already smoothed the variance away.

▶ Run on Databricks

Installation

uv add insurance-distill
# or
pip install insurance-distill

With CatBoost support:

uv add "insurance-distill[catboost]"

Questions or feedback? Start a Discussion. Found it useful? A star helps others find it.

Quick start

from insurance_distill import SurrogateGLM

# fitted_catboost: any sklearn-compatible model (CatBoost, sklearn GBM, etc.)
surrogate = SurrogateGLM(
    model=fitted_catboost,
    X_train=X_train,          # Polars DataFrame
    y_train=y_train,          # actual claim counts or amounts
    exposure=exposure_arr,    # earned car-years (or None for unit exposure)
    family="poisson",         # or "gamma" for severity
)

surrogate.fit(
    max_bins=10,                                   # bins per continuous variable
    interaction_pairs=[("driver_age", "region")],  # optional interaction terms
)

# Validation
report = surrogate.report()
print(report.metrics.summary())
# Gini (GBM):              0.3241
# Gini (GLM surrogate):    0.3087
# Gini ratio:              95.2%
# Deviance ratio:          0.9143
# Max segment deviation:   8.3%
# Mean segment deviation:  2.1%
# Segments evaluated:      312

# Inspect a single factor table
driver_age_table = surrogate.factor_table("driver_age")
print(driver_age_table)
# shape: (8, 3)
# | level              | log_coefficient | relativity |
# | [-inf, 21.00)      | 0.412           | 1.510      |
# | [21.00, 25.00)     | 0.218           | 1.244      |
# ...

# Export all factor tables as CSV (one file per variable)
surrogate.export_csv("output/factors/", prefix="motor_freq_")
# Writes: motor_freq_driver_age.csv, motor_freq_vehicle_value.csv, ...

Binning strategies

Three binning methods are available. The default (tree) is the right choice for most variables.

Method	Description	When to use
`tree`	CART decision tree on GBM pseudo-predictions	Default. Finds statistically meaningful cut-points.
`quantile`	Equal-frequency bins	Fallback when the tree produces degenerate splits.
`isotonic`	Change-points from isotonic regression	Monotone variables (e.g. no-claims discount, years held).

You can mix methods per variable:

surrogate.fit(
    max_bins=10,
    binning_method="tree",
    method_overrides={
        "ncd_years": "isotonic",
        "vehicle_age": "quantile",
    },
)

Validation metrics

After fitting, surrogate.report() returns a DistillationReport with:

Gini ratio: how much of the GBM's discrimination the GLM retains. Above 0.90 is generally acceptable; above 0.95 is excellent.
Deviance ratio: analogous to R-squared for GLMs. Measures how well the GLM explains the GBM's predictions.
Max segment deviation: maximum relative difference between GBM and GLM, across all combinations of binned levels. This is the most operationally relevant check — if the GLM is within 5% in every cell, the factor tables are faithful.
Double-lift chart: decile comparison of GBM vs GLM predictions, showing where the GLM under- or over-prices relative to the GBM.

Expected Performance

Validated on synthetic UK motor frequency data with known true DGP (30,000-50,000 policies, 5 continuous rating factors + 1 categorical, CatBoost 300 iterations). Results from notebooks/02_validation_gini_retention.py (Databricks serverless, seed=42).

Model	Gini coefficient	Gini ratio vs CatBoost
CatBoost (GBM)	~0.30-0.34	100% (reference)
Surrogate GLM (this library, max_bins=10)	~0.28-0.33	90-97%
Direct GLM (fitted on raw claims)	~0.26-0.30	85-91%

Key findings from the validation notebook:

Surrogate GLM consistently outperforms direct GLM by 3-6 Gini ratio points. The reason: the surrogate learns from CatBoost's denoised predictions rather than individual claim counts. Poisson observation noise is already filtered out by the GBM; the GLM inherits that benefit.
Max segment deviation with max_bins=10 is typically 6-9%. Below 10% is acceptable for most rating engines; below 5% allows direct loading without manual review.
Gini ratio degrades gracefully with fewer bins: max_bins=5 costs roughly 3-5 Gini ratio points vs max_bins=10. There is no free lunch — simpler factor tables mean some discriminatory power is lost.
NCD years with isotonic binning produces monotone decreasing relativities, as expected. Tree binning on NCD occasionally produces non-monotone artefacts; isotonic binning eliminates this.
Factor tables export correctly: one CSV per variable, multiplicative relativities, base level at 1.0. Format is directly compatible with Radar and Emblem.

Effect of max_bins on Gini retention (typical range on this DGP):

max_bins	Gini ratio vs CatBoost	Max segment deviation
5	~87-92%	~10-15%
7	~90-94%	~8-12%
10	~92-97%	~6-9%
15	~93-97%	~5-8%

10 bins is the inflection point: adding more bins yields diminishing Gini returns while increasing factor table complexity.

The honest caveat: the 90-97% range assumes a well-specified DGP. Books with very high interaction effects not captured in interaction_pairs, or with extreme exposure imbalances, can fall below 90%. Always inspect report.metrics.summary() — if the Gini ratio is below 88%, add interaction terms before signing off the tables.

Computational Performance

Benchmarked on Databricks serverless compute. All timings use the default tree binning strategy.

Task	n=10,000	n=50,000	n=250,000
`SurrogateGLM.fit()` (6 continuous + 1 categorical)	0.4s	1.8s	9.1s
`surrogate.report()` (all metrics)	0.2s	0.6s	2.9s
`surrogate.export_csv()` (7 factor tables)	< 0.1s	< 0.1s	< 0.1s
Full workflow end-to-end	0.7s	2.5s	12.3s

The dominant cost is the GLM fit in glum, which scales roughly linearly with rows. For portfolios above 500,000 policies you can pass a stratified subsample to SurrogateGLM for fitting and run report() on the full dataset — the factor tables are evaluated on all data regardless.

Gini ratio by number of bins (max_bins): fewer bins reduces the GLM's degrees of freedom and lowers the Gini ratio. Ten bins is a reasonable default for most continuous variables. Dropping to five bins typically costs 2–4 Gini ratio points.

Design choices

Why glum, not statsmodels? glum is purpose-built for the kind of large, sparse GLMs that insurance pricing produces. It is 10–100x faster than statsmodels for problems with many one-hot encoded features, and it handles L1/L2 regularisation natively. The coefficient estimates are identical to statsmodels for the unregularised case.

Why Polars? We use Polars for data handling because it is faster and more memory-efficient than pandas for the aggregation operations (segment deviation, lift charts) that this library relies on. The GLM fitting itself uses numpy arrays internally, as glum requires.

Why pseudo-predictions, not actual claims? Fitting the GLM on GBM predictions rather than actual claims eliminates the noise from individual claim events. The GBM has already smoothed over that noise. Fitting the surrogate on the GBM's output gives a cleaner signal for the GLM to learn from, resulting in better-preserved Gini.

Multiplicative by construction The GLM always uses a log link function. This means the factor tables are multiplicative: the final premium is the product of the base rate and each factor. This is the convention used by Radar, Emblem, Guidewire, and most other UK personal lines rating engines.

Factor table format

Each factor table is a Polars DataFrame with three columns:

Column	Type	Description
`level`	str	Bin label (e.g. `[25.00, 40.00)`) or category value
`log_coefficient`	float	Raw GLM coefficient on log scale (0.0 for base level)
`relativity`	float	Multiplicative factor = exp(log_coefficient)

The base level (reference category) always has relativity = 1.0. All other levels are expressed relative to it.

Limitations

Gini ratio is not guaranteed. The 90–97% range cited is typical for well-structured motor and property books. Books with very high feature cardinality, strong non-linear interactions, or thin exposure in key segments can produce lower Gini ratios. Always inspect report.metrics.summary() before signing off the factor tables.

Interactions must be specified manually. The surrogate GLM does not automatically discover interaction terms. If the GBM is relying on a strong driver_age x region interaction, you need to pass interaction_pairs=[("driver_age", "region")] explicitly. Failure to do so will result in the GLM approximating the marginal effects only, and the max segment deviation will be high for those combinations.

Categorical variables with high cardinality. For categoricals with more than 30 levels (e.g. vehicle make, detailed occupation), the GLM will have many parameters and may overfit the GBM pseudo-predictions. Consider regrouping rare levels before fitting, or using regularisation via alpha_l2=.

No temporal validation. The surrogate fits on training data and is validated on the same data by default. For motor pricing, pass a held-out period (most recent accident year) to surrogate.report(X_val=, y_val=, exposure_val=) to confirm the factor tables generalise.

Rating engine rounding. Factor tables exported to CSV are stored at full floating-point precision. Most rating engines round to 3–4 decimal places. Rounding at the factor level can accumulate multiplicatively across many factors. Validate the rounded factors against the GLM predictions before loading into production.

glum regularisation defaults to zero. The default fit is unregularised. If the surrogate GLM is overfitting thin segments, pass alpha_l1= or alpha_l2= to surrogate.fit(). The regularisation path is not searched automatically.

No .predict() method on SurrogateGLM. The class does not yet expose a .predict(X) method for out-of-sample scoring. The benchmark script works around this by calling OptimalBinner.transform() and constructing the design matrix directly. A formal .predict() method is on the roadmap; until then, use the pattern in benchmarks/benchmark.py::predict_surrogate_on_holdout().

Requirements

Python >= 3.10
polars >= 0.20
numpy >= 1.24
scikit-learn >= 1.3
glum >= 2.0

References

Noll, A., Salzmann, R., & Wuthrich, M. V. (2020). Case study: French motor third-party liability claims. SSRN 3164764. The canonical reference for GBM-to-GLM distillation in insurance, demonstrating the pseudo-prediction approach on real French MTPL data.
Wuthrich, M. V., & Buser, C. (2023). Data analytics for non-life insurance pricing. RiskLab, ETH Zurich. Chapters 7-9 cover GLM surrogate methodology and factor table validation.
Yang, Y., Qian, W., & Zou, H. (2018). Insurance premium prediction via gradient tree-boosted Tweedie compound Poisson models. Journal of Business & Economic Statistics, 36(3), 456-470. Background on GBM frequency/severity models that precede distillation.
Lindholm, M., & Verrall, R. (2020). Regression models for non-life insurance pricing: Generalised linear models and beyond. Annals of Actuarial Science, 14(2), 370-399. Covers multiplicative GLM structure and rating factor interpretation.

Related libraries

Library	What it does
shap-relativities	Extract rating relativities directly from a GBM using SHAP — an alternative to distillation when you don't need rating engine compatibility
insurance-causal	Establish whether each rating factor causally drives risk before committing it to the factor table
insurance-fairness	Proxy discrimination auditing — run this before distilling to identify which factors should not be in the GLM

Other Burning Cost libraries

Model building

Library	Description
shap-relativities	Extract rating relativities from GBMs using SHAP
insurance-interactions	Automated GLM interaction detection via CANN and NID scores
insurance-cv	Walk-forward cross-validation respecting IBNR structure

Uncertainty quantification

Library	Description
insurance-conformal	Distribution-free prediction intervals for Tweedie models
bayesian-pricing	Hierarchical Bayesian models for thin-data segments
insurance-credibility	Bühlmann-Straub credibility weighting

Deployment and optimisation

Library	Description
insurance-optimise	Constrained rate change optimisation with FCA PS21/5 compliance
insurance-causal	Causal inference — establishes whether a rating factor causally drives risk or is a proxy for a protected characteristic
insurance-fairness	Proxy discrimination auditing for UK insurance models

Governance

Library	Description
insurance-governance	PRA SS1/23 model validation reports
insurance-monitoring	Model monitoring: PSI, A/E ratios, Gini drift test

All libraries and blog posts ->

Community

Questions? Start a Discussion
Found a bug? Open an Issue
Blog & tutorials: burning-cost.github.io

If this library saves you time, a star on GitHub helps others find it.

Licence

MIT. See LICENSE.

Need help implementing this in production? Talk to us.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.6

Mar 25, 2026

0.1.5

Mar 22, 2026

0.1.4

Mar 20, 2026

0.1.3

Mar 20, 2026

0.1.2

Mar 17, 2026

0.1.1

Mar 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insurance_distill-0.1.6.tar.gz (188.0 kB view details)

Uploaded Mar 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

insurance_distill-0.1.6-py3-none-any.whl (34.5 kB view details)

Uploaded Mar 25, 2026 Python 3

File details

Details for the file insurance_distill-0.1.6.tar.gz.

File metadata

Download URL: insurance_distill-0.1.6.tar.gz
Upload date: Mar 25, 2026
Size: 188.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for insurance_distill-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`5ebd6c41116b7c88e173e8f70a3204a127e1d25f112c097d8347ca534820e6f3`
MD5	`f2df2a699f324925ef0611612e292dca`
BLAKE2b-256	`8ec82774c7f3edccdd192eab3837eac5abbc471dad5d8fd9eaa21f8457d97586`

See more details on using hashes here.

File details

Details for the file insurance_distill-0.1.6-py3-none-any.whl.

File metadata

Download URL: insurance_distill-0.1.6-py3-none-any.whl
Upload date: Mar 25, 2026
Size: 34.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for insurance_distill-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2be5ed9e681c0220794c60154dcfbe7536bb8b4681b4d3220153008011e044f5`
MD5	`8c135517e7acd316d16638b424808254`
BLAKE2b-256	`52f0d1d8e4320eb87a48fa61fc37725e5cfd2202a6fe45a9c8b8bb5bb01b9cdf`

See more details on using hashes here.

insurance-distill 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

insurance-distill

The problem

Why bother

Installation

Quick start

Binning strategies

Validation metrics

Expected Performance

Computational Performance

Design choices

Factor table format

Limitations

Requirements

References

Related libraries

Other Burning Cost libraries

Community

Licence

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes