Nested GLM with neural network entity embeddings and spatially constrained territory clustering for insurance ratemaking
Project description
insurance-nested-glm
GLM ratemaking is well understood. The problem is what to do with the variables that don't fit cleanly into it: vehicle make/model has thousands of levels, postcode sector has even more, and the standard GLM response — group them or drop them — throws away real signal.
This library implements the nested GLM framework from Wang, Shi, Cao (NAAJ 2025). The idea is a four-phase pipeline:
- Fit a base GLM on the structured factors you trust (age band, NCD, vehicle group, etc.).
- Train a shallow neural network with entity embeddings to encode the high-cardinality categoricals. The base GLM log-prediction enters as an offset — the network learns a correction, not a replacement.
- Cluster spatial units (postcode sectors, output areas) into territory bands using spatially constrained clustering (SKATER), fed by the learned embedding coordinates. Every territory is geographically contiguous by construction.
- Fit an outer GLM on the structured factors, the embedding vectors (as continuous regressors), and the territory fixed effect. The result is a GLM you can read — relativities table, deviance, AIC — not a black box.
The full cycle runs in a single pipeline.fit() call.
Install
Core (requires PyTorch and statsmodels):
pip install insurance-nested-glm
With spatial clustering (geopandas, libpysal, spopt):
pip install insurance-nested-glm[spatial]
With plotting:
pip install insurance-nested-glm[plot]
Everything:
pip install insurance-nested-glm[all]
Quick start
import pandas as pd
import numpy as np
from insurance_nested_glm import NestedGLMPipeline
# policies: one row per policy
df = pd.read_parquet("policies.parquet")
y = df["claim_count"].to_numpy()
exposure = df["earned_exposure"].to_numpy()
pipeline = NestedGLMPipeline(
base_formula="age_band + ncb + vehicle_group",
family="poisson",
n_territories=200,
min_territory_exposure=500,
embedding_epochs=50,
)
pipeline.fit(
df,
y,
exposure,
high_card_cols=["vehicle_make_model"],
base_formula_cols=["age_band", "ncb", "vehicle_group"],
)
# Multiplicative relativities — readable like a standard GLM
print(pipeline.relativities())
# Predictions
pred = pipeline.predict(df, exposure)
With spatial clustering
import geopandas as gpd
# geo_gdf: one row per postcode sector with polygon geometries
geo_gdf = gpd.read_file("postcode_sectors.gpkg")
pipeline.fit(
df,
y,
exposure,
geo_gdf=geo_gdf,
geo_id_col="postcode_sector",
high_card_cols=["vehicle_make_model"],
base_formula_cols=["age_band", "ncb"],
)
fig = pipeline.plot_territories(geo_gdf, geo_id_col="postcode_sector")
fig.savefig("territories.png", dpi=150)
API
NestedGLMPipeline
The main entry point. Parameters:
| Parameter | Default | Notes |
|---|---|---|
base_formula |
None |
Patsy rhs formula for structured base GLM |
family |
'poisson' |
'poisson' or 'gamma' |
n_territories |
200 |
Target territory count |
min_territory_exposure |
None |
Credibility filter: merge territories below this exposure |
embedding_epochs |
50 |
Training epochs for embedding network |
embedding_hidden_sizes |
(64,) |
Dense layer sizes in embedding net |
embedding_lr |
1e-3 |
Adam learning rate |
cluster_method |
'skater' |
'skater' or 'maxp' |
EmbeddingTrainer
If you want to use the embedding step in isolation:
from insurance_nested_glm import EmbeddingTrainer
trainer = EmbeddingTrainer(
cat_cols=["vehicle_make_model"],
epochs=50,
hidden_sizes=(64, 32),
)
trainer.fit(df, y, exposure=exposure, offset=base_log_pred)
# Dense vectors, shape (n, total_embedding_dim)
emb = trainer.transform(df)
# DataFrame: category → embedding coordinates, one per col
frames = trainer.get_embedding_frame()
print(frames["vehicle_make_model"].head())
Embedding dimension defaults to min(50, ceil(n_levels / 2)) per column. Override with embedding_dims={"vehicle_make_model": 20}.
TerritoryClusterer
from insurance_nested_glm import TerritoryClusterer
tc = TerritoryClusterer(n_clusters=200, min_exposure=500, method="skater")
tc.fit(geo_gdf, feature_cols=["emb_0", "emb_1", ...], exposure=unit_exposure)
# pd.Series of 1-indexed territory labels, aligned with geo_gdf
print(tc.labels_)
Island handling: disconnected components in the adjacency graph (Channel Islands, Isle of Man, Orkney, Shetland) are detected automatically and clustered independently.
NestedGLM
The outer GLM, available separately:
from insurance_nested_glm import NestedGLM
glm = NestedGLM(family="poisson", formula="age_band + ncb")
glm.fit(X_with_embeddings_and_territory, y, exposure)
print(glm.relativities())
print(glm.aic(), glm.bic())
Utility functions
from insurance_nested_glm import credibility_report, build_adjacency
# Exposure / claims summary per territory
report = credibility_report(labels, exposure, claims)
# Build Queen contiguity weights directly
w = build_adjacency(geo_gdf)
Design notes
Why CANN-style offset? The embedding network takes the base GLM log-prediction as an offset. This means the structured factors are not re-learned from scratch — the network only corrects what the GLM misses. It trains faster and is less prone to overfitting on the high-cardinality features.
Why SKATER for territories? SKATER (Spatial K-luster Analysis by Tree Edge Removal) builds a minimum spanning tree over the spatial units and prunes edges to form k subtrees. Every territory is a connected subgraph, which is a regulatory requirement in UK motor pricing. MaxP is available as an alternative for threshold-based approaches.
Why statsmodels for the outer GLM? Because pricing teams need the coefficient table, standard errors, LRT results, and AIC. A sklearn model gives you none of that. The outer GLM wraps statsmodels.formula.api.glm and exposes relativities() in multiplicative form.
Embedding dimension heuristic. min(50, ceil(n_levels / 2)) follows the standard entity embedding rule of thumb from Guo and Berkhahn (2016). Override it if you have a reason to.
References
Wang R, Shi H, Cao J (2025). A Nested GLM Framework with Neural Network Encoding and Spatially Constrained Clustering in Non-Life Insurance Ratemaking. North American Actuarial Journal, 29(3).
Guo C, Berkhahn F (2016). Entity Embeddings of Categorical Variables. arXiv:1604.06737.
Asselman D, Schelldorfer J, Wüthrich M V (2022). CANN: Combined Actuarial Neural Networks. SSRN.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file insurance_nested_glm-0.1.0.tar.gz.
File metadata
- Download URL: insurance_nested_glm-0.1.0.tar.gz
- Upload date:
- Size: 31.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8e22d8f49d1562c91a88926ca1e6fa9835dce45845faee5ce739b7bd42cd66e
|
|
| MD5 |
088f1b8a04fe8825866bbbdf7f0d7128
|
|
| BLAKE2b-256 |
a63955b5d130df7b8b81c0a9aec0d256eb168c78c8bc9610dc2c7932ed1078ed
|
File details
Details for the file insurance_nested_glm-0.1.0-py3-none-any.whl.
File metadata
- Download URL: insurance_nested_glm-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27adc3f4e188a65dc1e2f0fae085c209cca7fff837a7038d4ce4608c7a76b154
|
|
| MD5 |
87b3ecf469a7f77d738079f44ffc9bb5
|
|
| BLAKE2b-256 |
5b68ddcc6dca0212c73b63a6d9c27e08368f64c890fff6c096eac5aa82c48883
|