Automated GLM factor-level clustering for insurance pricing using the R2VF algorithm

These details have not been verified by PyPI

Project links

Project description

insurance-glm-cluster

Automated GLM factor-level clustering for UK motor insurance pricing.

The problem

Every motor pricing actuary knows this: you have a factor with 16 vehicle age bands, and you need to work out which ones can be merged without losing predictive signal. Do you merge band 8 with 7 or with 9? What about the extremes where there are three policies and one claim?

Today this is done manually. You plot the loss ratios, eyeball the pattern, argue about it in a model governance meeting, and end up with something defensible but not optimal. With 20 rating factors and hundreds of levels between them, that process takes weeks.

This library automates it using the R2VF algorithm (Ben Dror 2025, arXiv:2503.01521). The core idea: for ordinal factors, the fused lasso (which merges adjacent levels) reduces to a standard L1 lasso after a change of basis. That means you can use existing, well-tested solvers rather than writing a custom optimiser.

What it does

R2VF Step 2: fits a penalised GLM on the split-coded design matrix. When the lasso shrinks a "difference" coefficient to zero, it's merging two adjacent levels. BIC picks the regularisation strength automatically.

R2VF Step 3: refits an unpenalised GLM on the merged encoding. This removes the shrinkage bias from Step 2 and gives you proper MLE estimates.

MVP scope: ordinal factors, Poisson and Gamma families, BIC lambda selection, min-exposure constraint.

Installation

pip install insurance-glm-cluster

Quick start

import pandas as pd
from insurance_glm_cluster import FactorClusterer

fc = FactorClusterer(
    family='poisson',     # claim frequency
    lambda_='bic',        # automatic lambda selection
    min_exposure=500.0,   # merge groups with < 500 exposure years
)

fc.fit(
    X,
    y,                           # claim counts
    exposure=exposure,           # years at risk
    ordinal_factors=['vehicle_age', 'ncd_years'],
)

# Inspect the groupings
lm = fc.level_map('vehicle_age')
print(lm.to_df())
#  original_level  merged_group  coefficient  group_exposure
#               0             0        0.000        2341.2
#               1             0        0.000        2287.8
#               2             0        0.000        2319.4
#               3             1        0.312        2201.3
#               ...

# Recode and refit
X_merged = fc.transform(X)
result = fc.refit_glm(X_merged, y, exposure=exposure)

API

`FactorClusterer`

Parameter	Type	Description
`family`	`'poisson'` \| `'gamma'`	Response distribution
`lambda_`	`float` \| `'bic'`	Regularisation strength, or auto-select
`n_lambda`	`int`	Grid size for BIC search (default 50)
`min_exposure`	`float`	Minimum group exposure (default 0, disabled)
`tol`	`float`	Zero-threshold for delta coefficients (default 1e-8)

`.fit(X, y, exposure, ordinal_factors)`

Fits Step 2 (penalised fusion) and determines merged groups.

`.transform(X)`

Returns a copy of X with factor columns replaced by integer group labels.

`.refit_glm(X, y, exposure)`

Fits Step 3 (unpenalised refit) and returns a statsmodels.GLMResults object.

`.level_map(factor)`

Returns a LevelMap for the named factor.

`.diagnostic_path`

DiagnosticPath object with BIC, deviance, and n_groups per lambda. None if lambda was fixed.

`LevelMap`

lm = fc.level_map('vehicle_age')
lm.n_levels         # 16
lm.n_groups         # 3
lm.to_df()          # tidy DataFrame: original_level, merged_group, coefficient, group_exposure
lm.group_summary()  # one row per group with list of constituent levels
lm.apply(series)    # recode a series of original values to group labels

Algorithm notes

Split-coding: for an ordinal factor with K levels and coefficients β, define δⱼ = βⱼ - βⱼ₋₁. The fused lasso penalty λ·Σ|δⱼ| is a plain L1 penalty on the deltas. Build the design matrix so column j has 1s for all observations with level ≥ j. Now the lasso on this matrix is equivalent to the fused lasso on the original one-hot matrix.

Exposure handling for Poisson: fitting on (y/exposure, weight=exposure) is algebraically equivalent to Poisson GLM with log(exposure) offset. sklearn's PoissonRegressor uses this trick internally.

BIC lambda selection: fits 50 lambdas from lambda_max to lambda_max/1000 on a log scale. lambda_max is the point where all factors collapse to a single group. BIC = -2·ℓ + K_eff·log(n) where K_eff counts distinct groups across all factors.

Min-exposure: after fusion, groups below min_exposure are absorbed into their nearest-coefficient neighbour (not nearest-level — a tiny group gets absorbed into whichever group it already looks most like).

References

Ben Dror, R. (2025). R2VF: Regularized Ranking for Variable Fusion in GLMs. arXiv:2503.01521.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Mar 14, 2026

0.1.1

Mar 15, 2026

This version

0.1.0

Mar 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insurance_glm_cluster-0.1.0.tar.gz (27.0 kB view details)

Uploaded Mar 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

insurance_glm_cluster-0.1.0-py3-none-any.whl (21.5 kB view details)

Uploaded Mar 11, 2026 Python 3

File details

Details for the file insurance_glm_cluster-0.1.0.tar.gz.

File metadata

Download URL: insurance_glm_cluster-0.1.0.tar.gz
Upload date: Mar 11, 2026
Size: 27.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_glm_cluster-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c258c04c036636d049b0d570ba42c8ea9b3be6237d0501c1d46222d3f2c41538`
MD5	`714e05a4706eb2080b60644234396cdc`
BLAKE2b-256	`a3bf24d410ce877b204208b3068e6ca838e2e189e253bf9886c17d7d31f27699`

See more details on using hashes here.

File details

Details for the file insurance_glm_cluster-0.1.0-py3-none-any.whl.

File metadata

Download URL: insurance_glm_cluster-0.1.0-py3-none-any.whl
Upload date: Mar 11, 2026
Size: 21.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_glm_cluster-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8220c0cc19787e85071affffc0e9ac7ef7bfeac3570942343865c4362742e0a2`
MD5	`d4f22afe9cc27db8ca0691c45c31d780`
BLAKE2b-256	`ec6731d0880a7a41003cbc4f08ce16a867d59453dfe5c736a562b63af7c630bd`

See more details on using hashes here.

insurance-glm-cluster 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

insurance-glm-cluster

The problem

What it does

Installation

Quick start

API

FactorClusterer

.fit(X, y, exposure, ordinal_factors)

.transform(X)

.refit_glm(X, y, exposure)

.level_map(factor)

.diagnostic_path

LevelMap

Algorithm notes

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`FactorClusterer`

`.fit(X, y, exposure, ordinal_factors)`

`.transform(X)`

`.refit_glm(X, y, exposure)`

`.level_map(factor)`

`.diagnostic_path`

`LevelMap`