LDA-based probabilistic risk profiling for insurance portfolios
Project description
insurance-lda-risk
LDA-based probabilistic risk profiling for insurance portfolios.
What problem does this solve for a UK pricing actuary?
Your GLM has 400 rating cells. Your portfolio has 60,000 policies. You have a reasonable fit for the policies you understand, but you have no clean answer to these questions:
- What kind of risk is this book, really? Not "vehicle group B, area suburban, age 36-50" — but in broad terms, how many distinct risk archetypes exist, and what mix does your portfolio have?
- Is the composition shifting? Renewals, aggregator volumes, and underwriting appetite changes all move the mix. If your 2024 book has proportionally more high-risk policies than 2023, your flat renewal rate change is under-priced.
- What makes a good book-transfer candidate? When you're pricing a TPA transfer or a binder, you want to know whether the incoming book looks like your existing portfolio or a different risk universe.
Standard tools — GLM cells, k-means, decision trees — all give hard cluster assignments. A policy is either in segment 3 or it is not. Real portfolios don't work that way. A young driver with a group A vehicle and low mileage is mostly low-risk but partly high-risk. Soft membership matters.
This library applies Latent Dirichlet Allocation to tabular insurance data, following Jamotton & Hainaut (2024). Each policy gets a probability vector across K latent risk profiles (topics). You can use that vector as features for downstream models, as a drift signal, or as a segmentation for actuarial review.
How it works
LDA was designed for text. The analogy to insurance is direct:
| NLP concept | Insurance equivalent |
|---|---|
| Corpus | Portfolio (D × V matrix) |
| Document | Policy |
| Word | A specific modality value (e.g. area=urban, age_band=17-25) |
| Topic | Latent risk profile |
| Document-topic distribution θ_d | Policy's soft membership across K risk profiles |
| Topic-word distribution β_k | Risk profile's characteristic modality mix |
Each policy gets a vector θ_d summing to 1. θ_d = [0.8, 0.15, 0.05] means "80% like profile 0, 15% like profile 1, 5% like profile 2".
Inference uses sklearn's online variational Bayes (Hoffman, Bach & Blei 2010).
Reference: Jamotton, C. & Hainaut, D. (2024). Topic Modelling for Insurance Losses. LIDAM Discussion Paper ISBA 2024/008, UCLouvain. https://dial.uclouvain.be/pr/boreal/object/boreal:285770
Installation
pip install insurance-lda-risk
Dependencies: scikit-learn, scipy, numpy, pandas, matplotlib.
Quick start
import pandas as pd
from insurance_lda_risk import InsuranceLDAEncoder, LDARiskProfiler, TopicValidator
# 1. Encode portfolio to (D x V) sparse matrix
enc = InsuranceLDAEncoder()
X = enc.fit_transform(
df,
cat_cols=["vehicle_group", "area", "age_band", "ncb_band"],
cont_cols=["vehicle_age", "annual_mileage"],
n_bins=10,
)
# 2. Fit LDA and get soft memberships
profiler = LDARiskProfiler(n_topics=8, random_state=42)
theta = profiler.fit_transform(X) # shape (n_policies, 8)
# 3. Validate topics against claims
validator = TopicValidator(distribution="poisson")
result = validator.validate(theta, y_claims=df["n_claims"], exposure=df["exposure"])
print(result.summary)
# claim_frequency total_exposure total_claims pct_policies
# topic
# 0 0.041234 12450.23 512.34 23.14
# 1 0.072891 8234.11 600.12 15.43
# ...
result.plot_frequencies()
API reference
InsuranceLDAEncoder
Converts a portfolio DataFrame to a (D × V) sparse count matrix.
enc = InsuranceLDAEncoder(missing_as_modality=True)
enc.fit(df, cat_cols, cont_cols=None, n_bins=10)
X = enc.transform(df) # scipy.sparse.csr_matrix (D, V)
X = enc.fit_transform(df, ...) # fit + transform in one call
enc.vocabulary_ # dict: "variable__modality" -> index
enc.feature_names_ # list of vocabulary terms
enc.variable_ranges_ # dict: variable -> list of modalities
enc.decode_topic(weights, top_n=10) # explain a topic in terms of modalities
Continuous variables are discretised into equal-frequency bins. Missing values become the __MISSING__ modality by default, so you do not lose policies with sparse covariate data.
LDARiskProfiler
profiler = LDARiskProfiler(
n_topics=8,
alpha=None, # Dirichlet prior on policy-topic dist (default: 1/K)
eta=None, # Dirichlet prior on topic-modality dist (default: 1/K)
max_iter=50,
random_state=42,
)
profiler.fit(X)
theta = profiler.transform(X) # np.ndarray (D, K)
theta = profiler.fit_transform(X)
profiler.components_ # (K, V) unnormalised β from sklearn
profiler.topic_modality_dist_ # (K, V) normalised β
profiler.perplexity_ # float (lower is better, but use deviance for K selection)
profiler.get_dominant_topic(theta) # (D,) argmax topic per policy
profiler.top_modalities_per_topic(enc.feature_names_, top_n=10)
exposure_weighted=True is planned for v0.2. It will weight each policy's contribution to the M-step by its exposure, following the modified γ update: γ_{d,k} = α + e_d · Σ_v n_{d,v} ϕ_{d,v,k}.
TopicValidator
validator = TopicValidator(distribution="poisson") # or 'binomial'
result = validator.validate(theta, y_claims, exposure=None)
result.deviance # float
result.null_deviance # float
result.deviance_reduction # float, 1 - deviance/null_deviance
result.summary # pd.DataFrame with per-topic stats
result.plot_frequencies() # bar chart of claim frequency by topic
The deviance metric is Poisson deviance (or Binomial cross-entropy for binary outcomes). The null model is the portfolio mean frequency applied uniformly. A positive deviance reduction means the topics are discriminating claim experience.
TopicSelector
selector = TopicSelector(k_range=range(2, 21), cv=5, distribution="poisson")
optimal_k = selector.select(X, y_claims=y, exposure=exp)
selector.optimal_k_ # int
selector.scores_ # pd.DataFrame: k, mean_deviance, std_deviance
selector.plot_elbow() # elbow curve
Uses held-out Poisson deviance rather than perplexity. Perplexity is the NLP metric; for insurance, you want to know whether more topics improve claim frequency discrimination.
If you do not have claim labels, omit y_claims to fall back to perplexity.
PortfolioDrift
drift = PortfolioDrift(profiler, alert_threshold=0.05)
result = drift.compute_drift(theta_t0, theta_t1,
labels=("2023", "2024"))
result.jsd # float [0, 1]: Jensen-Shannon divergence
result.per_topic_shift # pd.Series: t1 - t0 per topic
result.alert # bool: True if jsd > alert_threshold
result.plot_shift() # horizontal bar chart of composition change
# Multi-period drift
df = drift.compute_drift_series(
[theta_2021, theta_2022, theta_2023, theta_2024],
labels=["2021", "2022", "2023", "2024"],
)
# Stacked area chart
fig = drift.plot_composition(
[theta_2021, theta_2022, theta_2023, theta_2024],
labels=["2021", "2022", "2023", "2024"],
)
JSD is symmetric and bounded [0, 1]. A JSD of 0.05 between consecutive years is a practical alert threshold for UK personal lines. Values above 0.15 indicate a material composition shift that warrants pricing review.
Selecting K
There is no ground truth K for a portfolio. The right approach:
- Run
TopicSelectorover a plausible range (e.g. K = 2 to 20). - Look at the deviance elbow — the K after which additional topics produce diminishing claim frequency discrimination.
- Check that topics are interpretable: use
top_modalities_per_topicto read the dominant modalities for each topic. If topic 3 is "young, urban, high vehicle group" and topic 7 is "rural, mature, low vehicle group", the topics are meaningful. - Prefer smaller K if the elbow is ambiguous. Ten interpretable topics beat twenty noisy ones.
Jamotton & Hainaut (2024) found K = 10 optimal for a 62,000-policy Swedish motorcycle portfolio.
Worked example
See notebooks/insurance_lda_risk_demo.ipynb for a complete walkthrough on a synthetic UK motor portfolio, including topic interpretation, K selection, and multi-year drift analysis.
Limitations
- Mutual exclusivity: Standard LDA ignores the fact that modalities within a single variable are mutually exclusive (a policy cannot be simultaneously in age band 17-25 and 36-50). The paper acknowledges this and uses standard LDA anyway because it is computationally tractable and empirically effective.
- Exposure weighting: v0.1 does not weight policies by exposure. A 3-year fleet policy and a 1-month private car policy contribute equally to the inference. Exposure weighting is planned for v0.2.
- Topic instability: LDA results depend on the random seed and can vary across runs. Fix
random_statefor reproducibility, and consider runningPortfolioDriftacross multiple seeds to assess stability.
Licence
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file insurance_lda_risk-0.1.0.tar.gz.
File metadata
- Download URL: insurance_lda_risk-0.1.0.tar.gz
- Upload date:
- Size: 139.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2355a63f2c9dfa9f4522dc9d159296673e6ed0c03fd0b31e007718009aa341c4
|
|
| MD5 |
558495cef66ef49602fef6fff261ac16
|
|
| BLAKE2b-256 |
1a22f63189a3755510571c285d474b1cfe708259de0959cf0d29b4d067319748
|
File details
Details for the file insurance_lda_risk-0.1.0-py3-none-any.whl.
File metadata
- Download URL: insurance_lda_risk-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e028c191e13c94f854a30a3f5b6c5c7de577e37830f8c4e6d650f46e56157286
|
|
| MD5 |
d753c0603a4a9a7efe52f6fefad39268
|
|
| BLAKE2b-256 |
9963247c4d4ccf6e3404b4eaf58884785f09448e9d918086b0d509ecd3581c59
|