Skip to main content

Distil GBM models into multiplicative GLM factor tables for insurance rating engines.

Project description

insurance-distill

Distil GBM models into multiplicative GLM factor tables for insurance rating engines.

The problem

Your CatBoost model outperforms your GLM in Gini, but your rating engine (Radar, Emblem, or any multiplicative system) needs factor tables - not a black box. You cannot load a gradient boosted tree into Radar.

This library bridges that gap. It fits a Poisson or Gamma GLM using the GBM's predictions as the target (pseudo-predictions), bins continuous variables optimally, and exports the result as factor tables that a rating engine can consume directly.

The GLM surrogate will not match the GBM's Gini coefficient exactly. A well-tuned distillation typically retains 90-97% of the GBM's discrimination. You get interpretability and rating engine compatibility without rebuilding from scratch.

Installation

uv add insurance-distill

With CatBoost support:

uv add "insurance-distill[catboost]"

Quick start

from insurance_distill import SurrogateGLM

# fitted_catboost: any sklearn-compatible model (CatBoost, sklearn GBM, etc.)
surrogate = SurrogateGLM(
    model=fitted_catboost,
    X_train=X_train,          # Polars DataFrame
    y_train=y_train,          # actual claim counts or amounts
    exposure=exposure_arr,    # earned car-years (or None for unit exposure)
    family="poisson",         # or "gamma" for severity
)

surrogate.fit(
    max_bins=10,                                   # bins per continuous variable
    interaction_pairs=[("driver_age", "region")],  # optional interaction terms
)

# Validation
report = surrogate.report()
print(report.metrics.summary())
# Gini (GBM):              0.3241
# Gini (GLM surrogate):    0.3087
# Gini ratio:              95.2%
# Deviance ratio:          0.9143
# Max segment deviation:   8.3%
# Mean segment deviation:  2.1%
# Segments evaluated:      312

# Inspect a single factor table
driver_age_table = surrogate.factor_table("driver_age")
print(driver_age_table)
# shape: (8, 3)
# | level              | log_coefficient | relativity |
# | [-inf, 21.00)      | 0.412           | 1.510      |
# | [21.00, 25.00)     | 0.218           | 1.244      |
# ...

# Export all factor tables as CSV (one file per variable)
surrogate.export_csv("output/factors/", prefix="motor_freq_")
# Writes: motor_freq_driver_age.csv, motor_freq_vehicle_value.csv, ...

Binning strategies

Three binning methods are available. The default (tree) is the right choice for most variables.

Method Description When to use
tree CART decision tree on GBM pseudo-predictions Default. Finds statistically meaningful cut-points.
quantile Equal-frequency bins Fallback when the tree produces degenerate splits.
isotonic Change-points from isotonic regression Monotone variables (e.g. no-claims discount, years held).

You can mix methods per variable:

surrogate.fit(
    max_bins=10,
    binning_method="tree",
    method_overrides={
        "ncd_years": "isotonic",
        "vehicle_age": "quantile",
    },
)

Validation metrics

After fitting, surrogate.report() returns a DistillationReport with:

  • Gini ratio: how much of the GBM's discrimination the GLM retains. Above 0.90 is generally acceptable; above 0.95 is excellent.
  • Deviance ratio: analogous to R-squared for GLMs. Measures how well the GLM explains the GBM's predictions.
  • Max segment deviation: maximum relative difference between GBM and GLM, across all combinations of binned levels. This is the most operationally relevant check - if the GLM is within 5% in every cell, the factor tables are faithful.
  • Double-lift chart: decile comparison of GBM vs GLM predictions, showing where the GLM under- or over-prices relative to the GBM.

Design choices

Why glum, not statsmodels? glum is purpose-built for the kind of large, sparse GLMs that insurance pricing produces. It is 10-100x faster than statsmodels for problems with many one-hot encoded features, and it handles L1/L2 regularisation natively. The coefficient estimates are identical to statsmodels for the unregularised case.

Why Polars? We use Polars for data handling because it is faster and more memory-efficient than pandas for the aggregation operations (segment deviation, lift charts) that this library relies on. The GLM fitting itself uses numpy arrays internally, as glum requires.

Why pseudo-predictions, not actual claims? Fitting the GLM on GBM predictions rather than actual claims eliminates the noise from individual claim events. The GBM has already smoothed over that noise. Fitting the surrogate on the GBM's output gives a cleaner signal for the GLM to learn from, resulting in better-preserved Gini.

Multiplicative by construction The GLM always uses a log link function. This means the factor tables are multiplicative: the final premium is the product of the base rate and each factor. This is the convention used by Radar, Emblem, Guidewire, and most other UK personal lines rating engines.

Factor table format

Each factor table is a Polars DataFrame with three columns:

Column Type Description
level str Bin label (e.g. [25.00, 40.00)) or category value
log_coefficient float Raw GLM coefficient on log scale (0.0 for base level)
relativity float Multiplicative factor = exp(log_coefficient)

The base level (reference category) always has relativity = 1.0. All other levels are expressed relative to it.

Requirements

  • Python >= 3.10
  • polars >= 0.20
  • numpy >= 1.24
  • scikit-learn >= 1.3
  • glum >= 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insurance_distill-0.1.2.tar.gz (145.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

insurance_distill-0.1.2-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file insurance_distill-0.1.2.tar.gz.

File metadata

  • Download URL: insurance_distill-0.1.2.tar.gz
  • Upload date:
  • Size: 145.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_distill-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3c875fa1728fd9c0df1e7c0821d1f5ca5ab013c061f244b33acf85661bc12766
MD5 7c1e4d2145471b22c485bd863cf2d266
BLAKE2b-256 e9daaad088827b0240269c1370c9f8c7a7e340255ebcf0f76e7a8c1dfc2f8039

See more details on using hashes here.

File details

Details for the file insurance_distill-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: insurance_distill-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_distill-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b0fd1207984036a5154920b8b7ab9277f491864a22bb20f8d42a25ca4f6aeec4
MD5 2add7b47344697e0f439e306870b6cfb
BLAKE2b-256 1b2f62de44b0f61d5e7fc3205f0e3f1c7f3c192af4d6740c5a05c03a95949fa6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page