Automated GLM factor level clustering for insurance pricing using the R2VF algorithm

These details have not been verified by PyPI

Project links

Project description

insurance-glm-cluster

Automated GLM factor level clustering for insurance pricing.

The problem

You've got 500 vehicle makes in your motor book. Your pricing GLM needs to handle them. You can't fit 500 dummies — the data is too thin, the model will overfit, and you'll end up with nonsense relativities for rare makes.

The traditional fix is manual grouping: spend a week in Excel, consult a book of makes and models, build a lookup table, argue with underwriters. This works but doesn't scale, introduces analyst bias, and has to be redone every model cycle.

insurance-glm-cluster automates this. It collapses high-cardinality categorical factors into pricing bands using regularised regression, with proper statistical underpinning and no arbitrary decisions.

How it works

The library implements the R2VF algorithm (Ben Dror, arXiv:2503.01521, 2025). The key insight is that the standard fused lasso approach — penalising differences between adjacent factor level coefficients — requires a natural ordering. Ordinal factors (vehicle age, NCD years) have one; nominal factors (vehicle make, occupation) don't.

R2VF solves this in two steps:

Step 1 — Ranking. Fit a Ridge GLM on all factor dummies simultaneously. The resulting coefficients give a data-driven ordering for each nominal factor: levels with similar risk profiles end up adjacent, levels with different profiles end up far apart.

Step 2 — Fusion. Re-encode each nominal factor as ordinal using the Step 1 ranking. Apply a standard fused lasso (via the split-coding trick) to all factors. Where the fused lasso penalty drives adjacent-level differences to zero, those levels are merged.

Step 3 — Refit. Fit an unpenalised GLM on the merged groupings to remove shrinkage bias from Step 2.

The split-coding trick is what makes this practical without cvxpy or specialised solvers: transform the design matrix so that standard L1 (sklearn Lasso) achieves the fused lasso objective. No quadratic programming required.

Installation

pip install insurance-glm-cluster

With the faster glum backend:

pip install insurance-glm-cluster[fast]

With plotting:

pip install insurance-glm-cluster[plot]

Quick start

from insurance_glm_cluster import FactorClusterer

clusterer = FactorClusterer(
    family='poisson',
    link='log',
    lambda_='bic',              # select regularisation via BIC
    min_exposure=500,           # merge groups with < 500 earned years
    monotone_factors=['ncd'],   # enforce NCD to be monotone decreasing
    monotone_direction={'ncd': 'decreasing'},
)

clusterer.fit(
    X,
    y,
    exposure=exposure,
    ordinal_factors=['vehicle_age', 'ncd'],
    nominal_factors=['vehicle_make', 'occupation'],
)

# Merged group codes — drop-in replacement for original columns
X_merged = clusterer.transform(X)

# Inspect the groupings
lm = clusterer.level_map('vehicle_make')
print(lm.to_df())
#   original_level  merged_group  coefficient  exposure
# 0          AUDI             0        -0.12    4521.3
# 1           BMW             0        -0.12    3892.1
# 2          FORD             1         0.08   18920.4
# ...

# Unpenalised GLM on merged factors
result = clusterer.refit_glm(X_merged, y, exposure=exposure)
print(result.summary())

# Diagnostics
diag = clusterer.diagnostics()
print(f"Vehicle make: {diag['n_levels_before']['vehicle_make']} → "
      f"{diag['n_levels_after']['vehicle_make']} groups")
print(f"AIC before: {diag['aic_before']:.1f}, after: {diag['aic_after']:.1f}")

API

`FactorClusterer`

Parameter	Type	Default	Description
`family`	str	`'poisson'`	GLM family: `'poisson'`, `'gamma'`, `'tweedie'`
`link`	str	`'log'`	Link function
`method`	str	`'r2vf'`	Clustering method (only `'r2vf'` in Phase 1)
`lambda_`	float \| `'bic'`	`'bic'`	Regularisation strength or BIC selection
`n_ordinal_bins`	int	30	Initial bins for numeric/ordinal factors
`m_nominal_bins`	int	75	Maximum dummy levels for nominal factors in Step 1
`alpha`	float	2.0	1.0 = Lasso, 2.0 = Ridge for nominal Step 1
`min_exposure`	float	None	Minimum exposure per merged group
`min_claims`	int	None	Minimum claims per merged group
`monotone_factors`	list	`[]`	Factors to enforce monotonicity on
`monotone_direction`	dict	`{}`	Per-factor direction: `'increasing'` or `'decreasing'`
`backend`	str	`'statsmodels'`	GLM backend for refit: `'statsmodels'` or `'glum'`
`random_state`	int	42	Random seed

`LevelMap`

Returned by clusterer.level_map(factor_name).

lm.to_df()                      # DataFrame: original_level | merged_group | coefficient | exposure
lm.n_groups()                   # int: number of merged groups
lm.n_levels_original()          # int: original cardinality
lm.compression_ratio()          # float: levels / groups
lm.validate_monotone('increasing')  # bool
lm.plot()                       # matplotlib Figure (requires [plot] extra)

Design decisions

Why R2VF and not generalised fused lasso directly? GFL with all-pairs penalties is O(K²) in the number of levels. For 500 vehicle makes, that's 125,000 penalty terms. R2VF reduces this to O(K) by using the Step 1 ranking to impose an ordering, then running standard (1D) fused lasso.

Why sklearn Lasso for the fusion step, not statsmodels? Statsmodels doesn't do L1 penalised GLMs. The split-coding trick converts the fused lasso into a standard L1 problem on a transformed design matrix, which sklearn Lasso solves efficiently via coordinate descent. The regression target is exposure-adjusted (y/exposure with exposure as sample weights) to approximate the Poisson log-likelihood within sklearn's Gaussian-only Lasso.

Why BIC for lambda selection? Cross-validation on insurance data is methodologically awkward: policies across years are correlated, and CV folds will contain leakage from multi-year policyholders. BIC selects a lambda that balances fit and complexity in-sample, which is appropriate when the goal is factor grouping rather than held-out prediction.

Why is the refit step separate from fit()? Actuaries need to review the groupings before committing to a refit. The level_map() output is designed for this: you can inspect, challenge, and manually adjust the groups before running refit_glm(). Keeping the steps separate also means the clustering output is backend-agnostic.

References

Ben Dror, I. (2025). Variable Fusion for Insurance Pricing: R2VF Algorithm. arXiv:2503.01521.
Tibshirani, R. J., & Taylor, J. (2011). The solution path of the generalized lasso. Annals of Statistics, 39(3), 1335–1371.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Mar 14, 2026

This version

0.1.1

Mar 15, 2026

0.1.0

Mar 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insurance_glm_cluster-0.1.1.tar.gz (37.2 kB view details)

Uploaded Mar 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

insurance_glm_cluster-0.1.1-py3-none-any.whl (27.9 kB view details)

Uploaded Mar 15, 2026 Python 3

File details

Details for the file insurance_glm_cluster-0.1.1.tar.gz.

File metadata

Download URL: insurance_glm_cluster-0.1.1.tar.gz
Upload date: Mar 15, 2026
Size: 37.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_glm_cluster-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`ed0c34264c1c9d20acc8d440da86db2e5c35839dfc71fa73ef9c876983763406`
MD5	`2e0eede58c245fe8f71533b378ebbfeb`
BLAKE2b-256	`e4a3eb1df62fbf9085fcd61945a99657c9061b19a9b3c8d8d5911075bea98306`

See more details on using hashes here.

File details

Details for the file insurance_glm_cluster-0.1.1-py3-none-any.whl.

File metadata

Download URL: insurance_glm_cluster-0.1.1-py3-none-any.whl
Upload date: Mar 15, 2026
Size: 27.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_glm_cluster-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a0773036909ec0911eafedf1540d4f4ddd92011994baa50cdd5b17234179a818`
MD5	`9a88070f057b9de7e9fade1d48a2d974`
BLAKE2b-256	`1b400a920166e1262860a1c7d8bdb53a4450810900e8ee52cf75d275abe12f81`

See more details on using hashes here.

insurance-glm-cluster 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

insurance-glm-cluster

The problem

How it works

Installation

Quick start

API

`FactorClusterer`

`LevelMap`

Design decisions

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes