Nested GLM with neural network entity embeddings and spatially constrained territory clustering for insurance ratemaking

These details have not been verified by PyPI

Project links

Project description

insurance-nested-glm

GLM ratemaking is well understood. The problem is what to do with the variables that don't fit cleanly into it: vehicle make/model has thousands of levels, postcode sector has even more, and the standard GLM response — group them or drop them — throws away real signal.

This library implements the nested GLM framework from Wang, Shi, Cao (NAAJ 2025). The idea is a four-phase pipeline:

Fit a base GLM on the structured factors you trust (age band, NCD, vehicle group, etc.).
Train a shallow neural network with entity embeddings to encode the high-cardinality categoricals. The base GLM log-prediction enters as an offset — the network learns a correction, not a replacement.
Cluster spatial units (postcode sectors, output areas) into territory bands using spatially constrained clustering (SKATER), fed by the learned embedding coordinates. Every territory is geographically contiguous by construction.
Fit an outer GLM on the structured factors, the embedding vectors (as continuous regressors), and the territory fixed effect. The result is a GLM you can read — relativities table, deviance, AIC — not a black box.

The full cycle runs in a single pipeline.fit() call.

Install

Core (requires PyTorch and statsmodels):

pip install insurance-nested-glm

With spatial clustering (geopandas, libpysal, spopt):

pip install insurance-nested-glm[spatial]

With plotting:

pip install insurance-nested-glm[plot]

Everything:

pip install insurance-nested-glm[all]

Quick start

import pandas as pd
import numpy as np
from insurance_nested_glm import NestedGLMPipeline

# policies: one row per policy
df = pd.read_parquet("policies.parquet")
y = df["claim_count"].to_numpy()
exposure = df["earned_exposure"].to_numpy()

pipeline = NestedGLMPipeline(
    base_formula="age_band + ncb + vehicle_group",
    family="poisson",
    n_territories=200,
    min_territory_exposure=500,
    embedding_epochs=50,
)

pipeline.fit(
    df,
    y,
    exposure,
    high_card_cols=["vehicle_make_model"],
    base_formula_cols=["age_band", "ncb", "vehicle_group"],
)

# Multiplicative relativities — readable like a standard GLM
print(pipeline.relativities())

# Predictions
pred = pipeline.predict(df, exposure)

With spatial clustering

import geopandas as gpd

# geo_gdf: one row per postcode sector with polygon geometries
geo_gdf = gpd.read_file("postcode_sectors.gpkg")

pipeline.fit(
    df,
    y,
    exposure,
    geo_gdf=geo_gdf,
    geo_id_col="postcode_sector",
    high_card_cols=["vehicle_make_model"],
    base_formula_cols=["age_band", "ncb"],
)

fig = pipeline.plot_territories(geo_gdf, geo_id_col="postcode_sector")
fig.savefig("territories.png", dpi=150)

API

NestedGLMPipeline

The main entry point. Parameters:

Parameter	Default	Notes
`base_formula`	`None`	Patsy rhs formula for structured base GLM
`family`	`'poisson'`	`'poisson'` or `'gamma'`
`n_territories`	`200`	Target territory count
`min_territory_exposure`	`None`	Credibility filter: merge territories below this exposure
`embedding_epochs`	`50`	Training epochs for embedding network
`embedding_hidden_sizes`	`(64,)`	Dense layer sizes in embedding net
`embedding_lr`	`1e-3`	Adam learning rate
`cluster_method`	`'skater'`	`'skater'` or `'maxp'`

EmbeddingTrainer

If you want to use the embedding step in isolation:

from insurance_nested_glm import EmbeddingTrainer

trainer = EmbeddingTrainer(
    cat_cols=["vehicle_make_model"],
    epochs=50,
    hidden_sizes=(64, 32),
)
trainer.fit(df, y, exposure=exposure, offset=base_log_pred)

# Dense vectors, shape (n, total_embedding_dim)
emb = trainer.transform(df)

# DataFrame: category → embedding coordinates, one per col
frames = trainer.get_embedding_frame()
print(frames["vehicle_make_model"].head())

Embedding dimension defaults to min(50, ceil(n_levels / 2)) per column. Override with embedding_dims={"vehicle_make_model": 20}.

TerritoryClusterer

from insurance_nested_glm import TerritoryClusterer

tc = TerritoryClusterer(n_clusters=200, min_exposure=500, method="skater")
tc.fit(geo_gdf, feature_cols=["emb_0", "emb_1", ...], exposure=unit_exposure)

# pd.Series of 1-indexed territory labels, aligned with geo_gdf
print(tc.labels_)

Island handling: disconnected components in the adjacency graph (Channel Islands, Isle of Man, Orkney, Shetland) are detected automatically and clustered independently.

NestedGLM

The outer GLM, available separately:

from insurance_nested_glm import NestedGLM

glm = NestedGLM(family="poisson", formula="age_band + ncb")
glm.fit(X_with_embeddings_and_territory, y, exposure)

print(glm.relativities())
print(glm.aic(), glm.bic())

Utility functions

from insurance_nested_glm import credibility_report, build_adjacency

# Exposure / claims summary per territory
report = credibility_report(labels, exposure, claims)

# Build Queen contiguity weights directly
w = build_adjacency(geo_gdf)

Design notes

Why CANN-style offset? The embedding network takes the base GLM log-prediction as an offset. This means the structured factors are not re-learned from scratch — the network only corrects what the GLM misses. It trains faster and is less prone to overfitting on the high-cardinality features.

Why SKATER for territories? SKATER (Spatial K-luster Analysis by Tree Edge Removal) builds a minimum spanning tree over the spatial units and prunes edges to form k subtrees. Every territory is a connected subgraph, which is a regulatory requirement in UK motor pricing. MaxP is available as an alternative for threshold-based approaches.

Why statsmodels for the outer GLM? Because pricing teams need the coefficient table, standard errors, LRT results, and AIC. A sklearn model gives you none of that. The outer GLM wraps statsmodels.formula.api.glm and exposes relativities() in multiplicative form.

Embedding dimension heuristic. min(50, ceil(n_levels / 2)) follows the standard entity embedding rule of thumb from Guo and Berkhahn (2016). Override it if you have a reason to.

References

Wang R, Shi H, Cao J (2025). A Nested GLM Framework with Neural Network Encoding and Spatially Constrained Clustering in Non-Life Insurance Ratemaking. North American Actuarial Journal, 29(3).

Guo C, Berkhahn F (2016). Entity Embeddings of Categorical Variables. arXiv:1604.06737.

Asselman D, Schelldorfer J, Wüthrich M V (2022). CANN: Combined Actuarial Neural Networks. SSRN.

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insurance_nested_glm-0.1.0.tar.gz (31.8 kB view details)

Uploaded Mar 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

insurance_nested_glm-0.1.0-py3-none-any.whl (23.9 kB view details)

Uploaded Mar 9, 2026 Python 3

File details

Details for the file insurance_nested_glm-0.1.0.tar.gz.

File metadata

Download URL: insurance_nested_glm-0.1.0.tar.gz
Upload date: Mar 9, 2026
Size: 31.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_nested_glm-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a8e22d8f49d1562c91a88926ca1e6fa9835dce45845faee5ce739b7bd42cd66e`
MD5	`088f1b8a04fe8825866bbbdf7f0d7128`
BLAKE2b-256	`a63955b5d130df7b8b81c0a9aec0d256eb168c78c8bc9610dc2c7932ed1078ed`

See more details on using hashes here.

File details

Details for the file insurance_nested_glm-0.1.0-py3-none-any.whl.

File metadata

Download URL: insurance_nested_glm-0.1.0-py3-none-any.whl
Upload date: Mar 9, 2026
Size: 23.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_nested_glm-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`27adc3f4e188a65dc1e2f0fae085c209cca7fff837a7038d4ce4608c7a76b154`
MD5	`87b3ecf469a7f77d738079f44ffc9bb5`
BLAKE2b-256	`5b68ddcc6dca0212c73b63a6d9c27e08368f64c890fff6c096eac5aa82c48883`

See more details on using hashes here.

insurance-nested-glm 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

insurance-nested-glm

Install

Quick start

With spatial clustering

API

NestedGLMPipeline

EmbeddingTrainer

TerritoryClusterer

NestedGLM

Utility functions

Design notes

References

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes