Skip to main content

Automated GLM factor-level clustering for insurance pricing using the R2VF algorithm

Project description

insurance-glm-cluster

Automated GLM factor-level clustering for UK motor insurance pricing.

The problem

Every motor pricing actuary knows this: you have a factor with 16 vehicle age bands, and you need to work out which ones can be merged without losing predictive signal. Do you merge band 8 with 7 or with 9? What about the extremes where there are three policies and one claim?

Today this is done manually. You plot the loss ratios, eyeball the pattern, argue about it in a model governance meeting, and end up with something defensible but not optimal. With 20 rating factors and hundreds of levels between them, that process takes weeks.

This library automates it using the R2VF algorithm (Ben Dror 2025, arXiv:2503.01521). The core idea: for ordinal factors, the fused lasso (which merges adjacent levels) reduces to a standard L1 lasso after a change of basis. That means you can use existing, well-tested solvers rather than writing a custom optimiser.

What it does

R2VF Step 2: fits a penalised GLM on the split-coded design matrix. When the lasso shrinks a "difference" coefficient to zero, it's merging two adjacent levels. BIC picks the regularisation strength automatically.

R2VF Step 3: refits an unpenalised GLM on the merged encoding. This removes the shrinkage bias from Step 2 and gives you proper MLE estimates.

MVP scope: ordinal factors, Poisson and Gamma families, BIC lambda selection, min-exposure constraint.

Installation

pip install insurance-glm-cluster

Quick start

import pandas as pd
from insurance_glm_cluster import FactorClusterer

fc = FactorClusterer(
    family='poisson',     # claim frequency
    lambda_='bic',        # automatic lambda selection
    min_exposure=500.0,   # merge groups with < 500 exposure years
)

fc.fit(
    X,
    y,                           # claim counts
    exposure=exposure,           # years at risk
    ordinal_factors=['vehicle_age', 'ncd_years'],
)

# Inspect the groupings
lm = fc.level_map('vehicle_age')
print(lm.to_df())
#  original_level  merged_group  coefficient  group_exposure
#               0             0        0.000        2341.2
#               1             0        0.000        2287.8
#               2             0        0.000        2319.4
#               3             1        0.312        2201.3
#               ...

# Recode and refit
X_merged = fc.transform(X)
result = fc.refit_glm(X_merged, y, exposure=exposure)

API

FactorClusterer

Parameter Type Description
family 'poisson' | 'gamma' Response distribution
lambda_ float | 'bic' Regularisation strength, or auto-select
n_lambda int Grid size for BIC search (default 50)
min_exposure float Minimum group exposure (default 0, disabled)
tol float Zero-threshold for delta coefficients (default 1e-8)

.fit(X, y, exposure, ordinal_factors)

Fits Step 2 (penalised fusion) and determines merged groups.

.transform(X)

Returns a copy of X with factor columns replaced by integer group labels.

.refit_glm(X, y, exposure)

Fits Step 3 (unpenalised refit) and returns a statsmodels.GLMResults object.

.level_map(factor)

Returns a LevelMap for the named factor.

.diagnostic_path

DiagnosticPath object with BIC, deviance, and n_groups per lambda. None if lambda was fixed.

LevelMap

lm = fc.level_map('vehicle_age')
lm.n_levels         # 16
lm.n_groups         # 3
lm.to_df()          # tidy DataFrame: original_level, merged_group, coefficient, group_exposure
lm.group_summary()  # one row per group with list of constituent levels
lm.apply(series)    # recode a series of original values to group labels

Algorithm notes

Split-coding: for an ordinal factor with K levels and coefficients β, define δⱼ = βⱼ - βⱼ₋₁. The fused lasso penalty λ·Σ|δⱼ| is a plain L1 penalty on the deltas. Build the design matrix so column j has 1s for all observations with level ≥ j. Now the lasso on this matrix is equivalent to the fused lasso on the original one-hot matrix.

Exposure handling for Poisson: fitting on (y/exposure, weight=exposure) is algebraically equivalent to Poisson GLM with log(exposure) offset. sklearn's PoissonRegressor uses this trick internally.

BIC lambda selection: fits 50 lambdas from lambda_max to lambda_max/1000 on a log scale. lambda_max is the point where all factors collapse to a single group. BIC = -2·ℓ + K_eff·log(n) where K_eff counts distinct groups across all factors.

Min-exposure: after fusion, groups below min_exposure are absorbed into their nearest-coefficient neighbour (not nearest-level — a tiny group gets absorbed into whichever group it already looks most like).

References

Ben Dror, R. (2025). R2VF: Regularized Ranking for Variable Fusion in GLMs. arXiv:2503.01521.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insurance_glm_cluster-0.1.0.tar.gz (27.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

insurance_glm_cluster-0.1.0-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file insurance_glm_cluster-0.1.0.tar.gz.

File metadata

  • Download URL: insurance_glm_cluster-0.1.0.tar.gz
  • Upload date:
  • Size: 27.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_glm_cluster-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c258c04c036636d049b0d570ba42c8ea9b3be6237d0501c1d46222d3f2c41538
MD5 714e05a4706eb2080b60644234396cdc
BLAKE2b-256 a3bf24d410ce877b204208b3068e6ca838e2e189e253bf9886c17d7d31f27699

See more details on using hashes here.

File details

Details for the file insurance_glm_cluster-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: insurance_glm_cluster-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_glm_cluster-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8220c0cc19787e85071affffc0e9ac7ef7bfeac3570942343865c4362742e0a2
MD5 d4f22afe9cc27db8ca0691c45c31d780
BLAKE2b-256 ec6731d0880a7a41003cbc4f08ce16a867d59453dfe5c736a562b63af7c630bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page