Automated GLM factor level clustering for insurance pricing using the R2VF algorithm
Project description
insurance-glm-cluster
Automated GLM factor level clustering for insurance pricing.
The problem
You've got 500 vehicle makes in your motor book. Your pricing GLM needs to handle them. You can't fit 500 dummies — the data is too thin, the model will overfit, and you'll end up with nonsense relativities for rare makes.
The traditional fix is manual grouping: spend a week in Excel, consult a book of makes and models, build a lookup table, argue with underwriters. This works but doesn't scale, introduces analyst bias, and has to be redone every model cycle.
insurance-glm-cluster automates this. It collapses high-cardinality categorical factors into pricing bands using regularised regression, with proper statistical underpinning and no arbitrary decisions.
How it works
The library implements the R2VF algorithm (Ben Dror, arXiv:2503.01521, 2025). The key insight is that the standard fused lasso approach — penalising differences between adjacent factor level coefficients — requires a natural ordering. Ordinal factors (vehicle age, NCD years) have one; nominal factors (vehicle make, occupation) don't.
R2VF solves this in two steps:
Step 1 — Ranking. Fit a Ridge GLM on all factor dummies simultaneously. The resulting coefficients give a data-driven ordering for each nominal factor: levels with similar risk profiles end up adjacent, levels with different profiles end up far apart.
Step 2 — Fusion. Re-encode each nominal factor as ordinal using the Step 1 ranking. Apply a standard fused lasso (via the split-coding trick) to all factors. Where the fused lasso penalty drives adjacent-level differences to zero, those levels are merged.
Step 3 — Refit. Fit an unpenalised GLM on the merged groupings to remove shrinkage bias from Step 2.
The split-coding trick is what makes this practical without cvxpy or specialised solvers: transform the design matrix so that standard L1 (sklearn Lasso) achieves the fused lasso objective. No quadratic programming required.
Installation
pip install insurance-glm-cluster
With the faster glum backend:
pip install insurance-glm-cluster[fast]
With plotting:
pip install insurance-glm-cluster[plot]
Quick start
from insurance_glm_cluster import FactorClusterer
clusterer = FactorClusterer(
family='poisson',
link='log',
lambda_='bic', # select regularisation via BIC
min_exposure=500, # merge groups with < 500 earned years
monotone_factors=['ncd'], # enforce NCD to be monotone decreasing
monotone_direction={'ncd': 'decreasing'},
)
clusterer.fit(
X,
y,
exposure=exposure,
ordinal_factors=['vehicle_age', 'ncd'],
nominal_factors=['vehicle_make', 'occupation'],
)
# Merged group codes — drop-in replacement for original columns
X_merged = clusterer.transform(X)
# Inspect the groupings
lm = clusterer.level_map('vehicle_make')
print(lm.to_df())
# original_level merged_group coefficient exposure
# 0 AUDI 0 -0.12 4521.3
# 1 BMW 0 -0.12 3892.1
# 2 FORD 1 0.08 18920.4
# ...
# Unpenalised GLM on merged factors
result = clusterer.refit_glm(X_merged, y, exposure=exposure)
print(result.summary())
# Diagnostics
diag = clusterer.diagnostics()
print(f"Vehicle make: {diag['n_levels_before']['vehicle_make']} → "
f"{diag['n_levels_after']['vehicle_make']} groups")
print(f"AIC before: {diag['aic_before']:.1f}, after: {diag['aic_after']:.1f}")
API
FactorClusterer
| Parameter | Type | Default | Description |
|---|---|---|---|
family |
str | 'poisson' |
GLM family: 'poisson', 'gamma', 'tweedie' |
link |
str | 'log' |
Link function |
method |
str | 'r2vf' |
Clustering method (only 'r2vf' in Phase 1) |
lambda_ |
float | 'bic' |
'bic' |
Regularisation strength or BIC selection |
n_ordinal_bins |
int | 30 | Initial bins for numeric/ordinal factors |
m_nominal_bins |
int | 75 | Maximum dummy levels for nominal factors in Step 1 |
alpha |
float | 2.0 | 1.0 = Lasso, 2.0 = Ridge for nominal Step 1 |
min_exposure |
float | None | Minimum exposure per merged group |
min_claims |
int | None | Minimum claims per merged group |
monotone_factors |
list | [] |
Factors to enforce monotonicity on |
monotone_direction |
dict | {} |
Per-factor direction: 'increasing' or 'decreasing' |
backend |
str | 'statsmodels' |
GLM backend for refit: 'statsmodels' or 'glum' |
random_state |
int | 42 | Random seed |
LevelMap
Returned by clusterer.level_map(factor_name).
lm.to_df() # DataFrame: original_level | merged_group | coefficient | exposure
lm.n_groups() # int: number of merged groups
lm.n_levels_original() # int: original cardinality
lm.compression_ratio() # float: levels / groups
lm.validate_monotone('increasing') # bool
lm.plot() # matplotlib Figure (requires [plot] extra)
Design decisions
Why R2VF and not generalised fused lasso directly? GFL with all-pairs penalties is O(K²) in the number of levels. For 500 vehicle makes, that's 125,000 penalty terms. R2VF reduces this to O(K) by using the Step 1 ranking to impose an ordering, then running standard (1D) fused lasso.
Why sklearn Lasso for the fusion step, not statsmodels? Statsmodels doesn't do L1 penalised GLMs. The split-coding trick converts the fused lasso into a standard L1 problem on a transformed design matrix, which sklearn Lasso solves efficiently via coordinate descent. The regression target is exposure-adjusted (y/exposure with exposure as sample weights) to approximate the Poisson log-likelihood within sklearn's Gaussian-only Lasso.
Why BIC for lambda selection? Cross-validation on insurance data is methodologically awkward: policies across years are correlated, and CV folds will contain leakage from multi-year policyholders. BIC selects a lambda that balances fit and complexity in-sample, which is appropriate when the goal is factor grouping rather than held-out prediction.
Why is the refit step separate from fit()?
Actuaries need to review the groupings before committing to a refit. The level_map() output is designed for this: you can inspect, challenge, and manually adjust the groups before running refit_glm(). Keeping the steps separate also means the clustering output is backend-agnostic.
References
- Ben Dror, I. (2025). Variable Fusion for Insurance Pricing: R2VF Algorithm. arXiv:2503.01521.
- Tibshirani, R. J., & Taylor, J. (2011). The solution path of the generalized lasso. Annals of Statistics, 39(3), 1335–1371.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file insurance_glm_cluster-0.1.1.tar.gz.
File metadata
- Download URL: insurance_glm_cluster-0.1.1.tar.gz
- Upload date:
- Size: 37.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed0c34264c1c9d20acc8d440da86db2e5c35839dfc71fa73ef9c876983763406
|
|
| MD5 |
2e0eede58c245fe8f71533b378ebbfeb
|
|
| BLAKE2b-256 |
e4a3eb1df62fbf9085fcd61945a99657c9061b19a9b3c8d8d5911075bea98306
|
File details
Details for the file insurance_glm_cluster-0.1.1-py3-none-any.whl.
File metadata
- Download URL: insurance_glm_cluster-0.1.1-py3-none-any.whl
- Upload date:
- Size: 27.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0773036909ec0911eafedf1540d4f4ddd92011994baa50cdd5b17234179a818
|
|
| MD5 |
9a88070f057b9de7e9fade1d48a2d974
|
|
| BLAKE2b-256 |
1b400a920166e1262860a1c7d8bdb53a4450810900e8ee52cf75d275abe12f81
|