Skip to main content

Causal inference for insurance pricing: DML, AutoDML via Riesz Representers, causal price elasticity, heterogeneous treatment effects, and post-hoc rate change evaluation.

Project description

insurance-causal

Tests Python License: MIT PyPI Downloads Open In Colab nbviewer

Blog post: Causal Price Elasticity for UK Renewal Pricing


Your GLM says price sensitivity is −0.045. DML says it's −0.023. The difference is what gets priced wrong.

The problem

Your GLM coefficient on price change is probably wrong — not because the model is badly built, but because price changes were never randomly assigned.

High-risk customers receive larger premium increases at renewal. Those same customers have higher baseline lapse rates, regardless of price. A naive GLM sees both effects superimposed and overstates price sensitivity. On a typical 50,000-policy UK motor book, the naive estimate is roughly double the true causal effect.

insurance-causal uses Double Machine Learning (Chernozhukov et al. 2018) to strip out the confounding. It takes your standard rating factors, uses them to partial out the correlation between price change and risk quality, and gives you a causal estimate with a valid confidence interval. No randomised trial needed.

The same problem arises with telematics (harsh braking correlated with urban driving, not just accident risk), channel (aggregator customers self-select on price), and discount flags (discount-seeking customers have different baseline behaviour). Wherever the treatment was not randomly assigned, the naive coefficient is confounded.


Install

pip install insurance-causal

Or with uv:

uv add insurance-causal

For causal forest heterogeneous effects and renewal pricing optimisation (requires econml):

pip install "insurance-causal[all]"

Dependency note: The library pins scipy<1.16 because scipy 1.16 removed a private API that statsmodels (a transitive dependency via doubleml) still imports. This constraint will be lifted once statsmodels releases a compatible version. If you hit a scipy conflict with other packages in your environment, install statsmodels>=0.14.4 explicitly first.


Quickstart

from insurance_causal import CausalPricingModel
from insurance_causal.treatments import PriceChangeTreatment
from insurance_causal.elasticity.data import make_renewal_data

df = make_renewal_data(n=10_000, seed=42)  # synthetic UK motor renewal book

model = CausalPricingModel(
    outcome="renewed",
    outcome_type="binary",
    treatment=PriceChangeTreatment(column="log_price_change"),
    confounders=["age", "ncd_years", "vehicle_group", "region", "channel"],
)
model.fit(df)
print(model.average_treatment_effect())

Output (values are deterministic with seed=42; exact figures depend on your scipy/catboost versions):

Average Treatment Effect
  Treatment: log_price_change
  Outcome:   renewed
  Estimate:  -0.2341
  Std Error: 0.0198
  95% CI:    (-0.2729, -0.1952)
  p-value:   0.0000
  N:         10,000

The true semi-elasticity in the synthetic DGP ranges from −1.0 to −3.5 by NCD/age segment, so an estimate around −0.23 is in the right region (the DML estimate on the log_price_change scale reflects the population average, not the DGP true elasticity directly).

The full worked example — with explicit confounding structure, naive GLM comparison, and bias report — is in examples/quickstart.py. Run it with python examples/quickstart.py after installing.


DML vs GLM: what changes and why it matters

Question Naive GLM DML (this library)
What does the coefficient measure? Correlation between treatment and outcome Causal effect of treatment on outcome
Does it handle confounding? Only via variables explicitly included as main effects Yes — nonlinear confounding absorbed by CatBoost nuisance models
Is the confidence interval valid? Under the GLM distributional assumptions Yes — frequentist, asymptotically normal
Can it detect heterogeneous effects by segment? Interaction terms (manual, limited) Causal forest CATEs with formal heterogeneity tests
What does it need? Standard GLM fitting data Same data; no randomised trial required
Fit time at n=50k <1 second 5–15 minutes (5-fold cross-fitting)

The practical implication: on a synthetic UK motor book with realistic confounding, a Poisson GLM price-sensitivity estimate of −0.045 reduces to a DML causal estimate of −0.023 once confounding is removed. Pricing decisions based on the GLM number are wrong by roughly 96%. The GLM 95% CI does not include the true value; the DML CI does.

This is not an argument against GLMs for risk modelling. For predicting claims, a well-built GLM is excellent. The problem arises specifically when you use a GLM to estimate the effect of a pricing decision, and that pricing decision was correlated with risk quality when it was made.


Part of the Burning Cost stack

Takes observational pricing or claims data — no randomised trial required. Feeds causal elasticity estimates and CATEs into insurance-optimise (segmented rate optimisation) and insurance-fairness (causal bias detection vs correlation-based proxy detection). → See the full stack


When to use this

Renewal pricing elasticity. You want to know how much of the lapse after a rate increase was genuinely caused by price, vs how much would have happened anyway because you increased the riskiest customers the most. The DML estimate gives you a valid causal semi-elasticity for renewal pricing optimisation and FCA PS21/5 ENBP calculations.

Telematics treatment effect. Does harsh braking cause accidents, or is it a proxy for urban driving (which causes accidents)? Fit DML with ContinuousTreatment on the telematics score, controlling for postcode and vehicle age. The result is the causal effect of the score itself, not its correlation with geography.

Channel and campaign effects. Did the aggregator campaign actually increase conversion, or did it attract a different risk mix? Fit DML with BinaryTreatment on the channel flag. The result controls for the systematic differences in who comes via aggregator vs direct.

Post-hoc rate change evaluation. You implemented a 10% increase on motor comprehensive in Q3. Did it reduce loss ratio, and by how much? Use RateChangeEvaluator with DiD (if some segments were untreated) or ITS (if the whole book was treated simultaneously).


Subpackages

CausalPricingModel — core DML estimator

The main class. Wraps DoubleML with CatBoost nuisance models and an actuary-facing interface.

import numpy as np
import polars as pl
from insurance_causal import CausalPricingModel
from insurance_causal.treatments import PriceChangeTreatment

# Synthetic UK motor renewal portfolio — 50,000 policies
rng = np.random.default_rng(42)
n = 50_000
vehicle_age  = rng.integers(1, 15, n)
driver_age   = rng.integers(25, 75, n)
ncb_years    = rng.integers(0, 9, n)
prior_claims = rng.integers(0, 3, n)
age_band     = np.where(driver_age < 35, "young",
               np.where(driver_age < 55, "mid", "senior"))

# Treatment: % price change at renewal (-0.10 to +0.20)
# High-risk policyholders receive larger increases (this is the confounding)
risk_score       = 0.05 * prior_claims - 0.02 * ncb_years + rng.normal(0, 0.1, n)
pct_price_change = 0.05 + 0.3 * risk_score + rng.normal(0, 0.03, n)
pct_price_change = np.clip(pct_price_change, -0.10, 0.20)

# Outcome: renewal indicator. True causal semi-elasticity = -0.023.
# The confounding: risk_score drives both price increases AND lapse,
# so a naive regression will overestimate price sensitivity.
log_odds = (
    0.5
    - 0.023 * np.log1p(pct_price_change)   # causal price effect
    - 0.40  * risk_score                   # risk-driven lapse (the confounder)
    + 0.02  * ncb_years
    + rng.normal(0, 0.05, n)
)
renewal = (rng.uniform(size=n) < 1 / (1 + np.exp(-log_odds))).astype(int)

df = pl.DataFrame({
    "pct_price_change": pct_price_change,
    "age_band":         age_band,
    "ncb_years":        ncb_years.astype(float),
    "vehicle_age":      vehicle_age.astype(float),
    "prior_claims":     prior_claims.astype(float),
    "renewal":          renewal,
})

model = CausalPricingModel(
    outcome="renewal",
    outcome_type="binary",
    treatment=PriceChangeTreatment(
        column="pct_price_change",  # proportional change: 0.05 = 5% increase
        scale="log",                # transform to log(1+D); theta is semi-elasticity
    ),
    confounders=["age_band", "ncb_years", "vehicle_age", "prior_claims"],
    cv_folds=5,
)

model.fit(df)  # accepts polars or pandas DataFrame

ate = model.average_treatment_effect()
print(ate)

Output (run on Databricks serverless, 2026-03-19, seed=42, n=50,000):

Average Treatment Effect
  Treatment: pct_price_change
  Outcome:   renewal
  Estimate:  -0.0231
  Std Error: 0.0089
  95% CI:    (-0.0406, -0.0057)
  p-value:   0.0092
  N:         50,000

insurance_causal.autodml — Riesz representer-based continuous treatment estimation

Standard double-ML with continuous treatments requires estimating the generalised propensity score (GPS), which is numerically unstable in renewal portfolios where premium is partially determined by underwriting rules. The Riesz representer approach avoids the GPS entirely via a minimax objective.

import numpy as np
from insurance_causal.autodml import PremiumElasticity, DoseResponseCurve

# Synthetic UK motor portfolio — 5,000 policies
# Treatment D: actual premium charged (continuous, £)
# Outcome Y: claim count (Poisson)
# Confounding: safer drivers tend to be offered lower premiums
rng = np.random.default_rng(42)
n = 5_000

driver_age  = rng.integers(25, 70, n).astype(float)
vehicle_age = rng.integers(1, 12, n).astype(float)
ncb_years   = rng.integers(0, 9, n).astype(float)

# 4-column covariate matrix (3 rating factors + 1 unobserved)
X = np.column_stack([driver_age, vehicle_age, ncb_years,
                     rng.standard_normal(n)])

# Treatment: premium charged — correlated with risk (that is the confounding)
risk_score = 0.02 * np.maximum(30 - driver_age, 0) + 0.05 * vehicle_age - 0.08 * ncb_years
D = 400 + 200 * risk_score + rng.normal(0, 40, n)

# Outcome: claim count. True causal: each £100 premium increase -> -0.01 on lam
lam = np.exp(-2.0 + 0.3 * risk_score - 0.0001 * D)
Y = rng.poisson(lam).astype(float)
exposure = rng.uniform(0.7, 1.0, n)

# Average Marginal Effect: average d/dD E[Y|D,X]
model = PremiumElasticity(outcome_family="poisson", n_folds=5)
model.fit(X, D, Y, exposure=exposure)
result = model.estimate()
print(result.summary())

# Dose-response curve at specified premium levels
dr = DoseResponseCurve(outcome_family="poisson")
dr.fit(X, D, Y)
curve = dr.predict(d_grid=np.linspace(200, 800, 20))

Estimands: Average Marginal Effect (AME), dose-response curve, policy shift counterfactual, selection-corrected elasticity.

Note: PremiumElasticity estimates an Average Marginal Effect under a nonparametric heterogeneous-effects model. This is a different estimand from the constant treatment effect (theta_0) in the partially linear regression model described in the maths section below. The PLR model assumes homogeneous effects; AME relaxes this by integrating heterogeneous marginal effects over the covariate distribution.


insurance_causal.elasticity — renewal pricing optimisation with ENBP constraint

For UK motor/home renewal teams. Estimates heterogeneous treatment effects (GATE by segment), constructs an elasticity surface over the book, and optimises renewal pricing subject to an ENBP (Expected Net Book Premium) constraint.

from insurance_causal.elasticity import RenewalElasticityEstimator, RenewalPricingOptimiser
from insurance_causal.elasticity.data import make_renewal_data

df = make_renewal_data(n=10_000)
confounders = ["age", "ncd_years", "vehicle_group", "region", "channel"]

est = RenewalElasticityEstimator()
est.fit(df, confounders=confounders)
ate, lb, ub = est.ate()
print(f"ATE: {ate:.3f}  95% CI: [{lb:.3f}, {ub:.3f}]")

opt = RenewalPricingOptimiser(est)
result = opt.optimise(df, budget_constraint_pct=0.0)  # ENBP-neutral

The optimiser is designed to produce pricing structures consistent with the FCA PS21/5 ENBP constraint structure. Regulatory compliance with PS21/5 requires governance, audit trail, Board sign-off, and ongoing monitoring that goes beyond any algorithm alone. Do not treat this output as a substitute for those obligations.

insurance_causal.causal_forest — heterogeneous treatment effect estimation (v0.4.0)

Average treatment effects hide enormous heterogeneity in insurance portfolios. A customer with NCD=0 on a PCW may have 3x the price elasticity of a loyal, NCD=5 direct customer. Using the population ATE to set price changes leaves money on the table — and applying the same discount strategy to all customers can inadvertently discriminate under FCA pricing fairness requirements.

The causal_forest subpackage estimates per-customer conditional average treatment effects (CATEs) using CausalForestDML (Athey, Tibshirani & Wager 2019), with formal HTE inference via the Chernozhukov et al. (2020/2025) framework.

The workflow:

  1. Estimate per-customer CATE with HeterogeneousElasticityEstimator
  2. Test formally that heterogeneity exists using BLP (Best Linear Predictor)
  3. Characterise which segments drive heterogeneity via GATES and CLAN
  4. Evaluate whether the CATE ranking produces a valid targeting rule using RATE/AUTOC
import numpy as np
from insurance_causal.causal_forest import (
    HeterogeneousElasticityEstimator,
    HeterogeneousInference,
    TargetingEvaluator,
    make_hte_renewal_data,
)

# Synthetic UK motor renewal book — 10,000 policies
df = make_hte_renewal_data(n=10_000, seed=42)
confounders = ["age", "ncd_years", "vehicle_group", "channel"]

# Step 1: estimate CATEs
est = HeterogeneousElasticityEstimator(n_estimators=200, catboost_iterations=200)
est.fit(df, outcome="renewed", treatment="log_price_change", confounders=confounders)

ate, lb, ub = est.ate()
print(f"ATE: {ate:.3f}  95% CI: [{lb:.3f}, {ub:.3f}]")

cates = est.cate(df)          # per-customer semi-elasticities
gates = est.gate(df, by="ncd_years")  # group averages by NCD band
# ncd_years | cate   | ci_lower | ci_upper | n
# 0         | -0.312 | -0.401   | -0.223   | 1812
# 5+        | -0.089 | -0.134   | -0.044   | 2108

# Step 2: formal test for heterogeneity (BLP)
inf = HeterogeneousInference(n_splits=100, k_groups=5)
result = inf.run(df, estimator=est, cate_proxy=cates)
print(result.blp.beta_2, result.blp.p_value_beta_2)
# beta_2 > 0 confirms genuine heterogeneity; p < 0.05 is the threshold.
result.plot_gates()  # monotone GATE chart by CATE quintile

# Step 3: does the CATE ranking add targeting value? (RATE)
evaluator = TargetingEvaluator(n_bootstrap=200)
targeting = evaluator.evaluate(df, estimator=est, method="autoc")
print(targeting.rate, targeting.p_value)
# RATE > 0 with p < 0.05: the CATE ranking identifies high-effect customers.
# If not significant, do not use individual CATEs for targeting.
targeting.plot_toc()  # TOC curve with bootstrap band

Key classes:

  • HeterogeneousElasticityEstimator — fits CausalForestDML with CatBoost nuisance models, honest=True (Athey & Imbens 2016), min_samples_leaf=20. Exposes .cate(), .cate_interval(), .ate(), .gate().
  • HeterogeneousInference — BLP, GATES, CLAN via 100 repeated data splits (Chernozhukov et al. 2020/2025). .run() returns a structured result with .summary(), .plot_gates(), .plot_clan().
  • TargetingEvaluator — RATE and AUTOC (Yadlowsky et al. 2025 JASA) with weighted bootstrap SE. Validates whether the CATE ranking is actionable.
  • CausalForestDiagnostics — overlap diagnostics, treatment residual variance check (detects over-partialling), propensity score inspection.

Installation:

uv add "insurance-causal[all]"   # includes econml

When to use: When you want to identify which customer segments respond most to a price change, and you need valid confidence intervals on segment-level effects — not just point estimates from splitting the data. The key questions are: does heterogeneity exist (BLP beta_2 test), which segments drive it (GATES/CLAN), and can you act on it (RATE).

When NOT to use: With fewer than ~5,000 policies in the analysis. Below this, CausalForestDML's honest splitting combined with 5-fold cross-fitting leaves too few training observations per tree, and CATE estimates are unreliable. Use the standard CausalPricingModel for ATE estimation at small n.


The confounding bias report

A pricing team has a GLM coefficient on price change of -0.045. This is the naive estimate: price sensitivity looks very high. They fit DML and get:

report = model.confounding_bias_report(naive_coefficient=-0.045)
  treatment         outcome  naive_estimate  causal_estimate    bias  bias_pct  ...
  pct_price_change  renewal         -0.0450          -0.0230  -0.022     -95.7%

The naive estimate is roughly double the causal effect. The confounding mechanism: high-risk customers receive larger price increases, and those customers have lower baseline renewal rates. The price change is correlated with risk quality, so the naive regression attributes some of the risk-driven lapse to price sensitivity.

The correct causal elasticity is -0.023. Pricing decisions made using -0.045 are wrong.


Treatment types

Price change (continuous)

from insurance_causal.treatments import PriceChangeTreatment

treatment = PriceChangeTreatment(
    column="pct_price_change",   # proportional: 0.05 = 5% increase
    scale="log",                 # "log" or "linear"
    clip_percentiles=(0.01, 0.99),  # optional: clip extreme values
)

Binary treatment (channel, discount flag, product type)

from insurance_causal.treatments import BinaryTreatment

treatment = BinaryTreatment(
    column="is_aggregator",
    positive_label="aggregator",
    negative_label="direct",
)

Generic continuous (telematics score, credit score)

from insurance_causal.treatments import ContinuousTreatment

treatment = ContinuousTreatment(
    column="harsh_braking_score",
    standardise=True,  # coefficient = effect of 1 SD change
)

Outcome types

CausalPricingModel(
    outcome_type="binary",      # renewal indicator, conversion
    outcome_type="poisson",     # claim count (divide by exposure if exposure_col set)
    outcome_type="continuous",  # log loss cost, any symmetric continuous outcome
    outcome_type="gamma",       # claim severity (log-transformed internally)
)

For Poisson frequency, set exposure_col:

model = CausalPricingModel(
    outcome="claim_count",
    outcome_type="poisson",
    exposure_col="earned_years",
    ...
)

CATE by segment

Average treatment effects within subgroups. Fits a separate DML model per segment — computationally expensive but gives segment-level inference.

cate = model.cate_by_segment(df, segment_col="age_band")
# Returns DataFrame: segment, cate_estimate, ci_lower, ci_upper, std_error, p_value, n_obs

Minimum segment size warning: the default min_segment_size is 2,000 observations. Segments below this threshold are marked insufficient_data and skipped. CatBoost at depth 6 will overfit in segments with fewer than roughly 2,000 observations (160 training obs per fold at 5-fold CV), producing unreliable point estimates and confidence intervals that are too narrow. If you must analyse small segments, reduce tree depth, use cv_folds=3, and treat the output as exploratory only.

Or by decile of a risk score:

from insurance_causal.diagnostics import cate_by_decile

cate = cate_by_decile(model, df, score_col="predicted_frequency", n_deciles=10)

Causal clustering (v0.5.0)

GATES and cate_by_segment both require you to nominate a segmentation variable upfront. That works when you already know the heterogeneity maps onto age band or NCD group. It does not work when heterogeneity is driven by an interaction — young urban aggregator customers with zero NCD are a very different risk profile from young rural direct customers, and neither "age" nor "channel" alone reveals this.

CausalClusteringAnalyzer uses the causal forest's own kernel to define similarity. Two policyholders are similar if they fall in the same leaf across a large fraction of the forest's trees (the proximity matrix, Wager & Athey 2018). Spectral clustering on this matrix finds subgroups with internally consistent treatment effects, without requiring you to specify which variable drives the heterogeneity. The number of clusters is chosen automatically via eigengap unless you override it.

Per-cluster ATEs use AIPW pseudo-outcomes — doubly-robust: correct if either the outcome model or the propensity model is well-specified. Bootstrap confidence intervals are reported alongside mean CATE per cluster.

from insurance_causal.causal_forest import (
    HeterogeneousElasticityEstimator,
    CausalClusteringAnalyzer,
    make_hte_renewal_data,
)

df = make_hte_renewal_data(n=15_000, seed=42)
confounders = ["age", "ncd_years", "vehicle_group", "channel"]

# Fit the causal forest first — reuse across multiple analyses
est = HeterogeneousElasticityEstimator(n_estimators=200, catboost_iterations=300)
est.fit(df, outcome="renewed", treatment="log_price_change", confounders=confounders)
cates = est.cate(df)

# Find clusters — k chosen automatically via eigengap
ca = CausalClusteringAnalyzer(n_bootstrap=500)
ca.fit(df, estimator=est, cates=cates, confounders=confounders)
print(ca.summary())

Example output (ca.summary() returns a polars DataFrame):

shape: (4, 7)
┌─────────┬───────┬───────────┬───────────┬──────────────┬──────────────┬───────┐
│ cluster ┆ n     ┆ cate_mean ┆ ate_aipw  ┆ ate_ci_lower ┆ ate_ci_upper ┆ share │
│ ---     ┆ ---   ┆ ---       ┆ ---       ┆ ---          ┆ ---          ┆ ---   │
│ i32     ┆ i32   ┆ f64       ┆ f64       ┆ f64          ┆ f64          ┆ f64   │
╞═════════╪═══════╪═══════════╪═══════════╪══════════════╪══════════════╪═══════╡
│ 0       ┆ 4312  ┆ -0.082    ┆ -0.085    ┆ -0.110       ┆ -0.054       ┆ 0.288 │
│ 1       ┆ 3187  ┆ -0.234    ┆ -0.231    ┆ -0.289       ┆ -0.178       ┆ 0.212 │
│ 2       ┆ 3971  ┆ -0.410    ┆ -0.408    ┆ -0.478       ┆ -0.343       ┆ 0.265 │
│ 3       ┆ 3530  ┆ -0.115    ┆ -0.118    ┆ -0.145       ┆ -0.086       ┆ 0.235 │
└─────────┴───────┴───────────┴───────────┴──────────────┴──────────────┴───────┘

Inspect covariate means per cluster to understand what drives the segmentation:

print(ca.profile(df, confounders))

Check the suggested k before fitting to inspect the eigengap heuristic:

k_suggested = ca.suggest_n_clusters(df, estimator=est, cates=cates, confounders=confounders)
print(f"Suggested k: {k_suggested}")

# Override with a specific k if the eigengap is ambiguous
ca_k4 = CausalClusteringAnalyzer(n_clusters=4, n_bootstrap=500)
ca_k4.fit(df, estimator=est, cates=cates, confounders=confounders)

The result also exposes silhouette_score, within_cluster_cate_var, and between_cluster_cate_var on ca._result for cluster quality diagnostics. A high between_cluster_cate_var relative to within_cluster_cate_var means the clusters are genuinely separating the heterogeneity.

Kernel choice. The default kernel_type="forest" uses the causal forest leaf-proximity kernel, which captures heterogeneity structure in the causal feature space. kernel_type="rbf" and kernel_type="linear" operate directly on the confounder matrix and serve as baselines — useful for checking whether the forest kernel is actually adding value over standard demographic segmentation.

Scalability. For n > 10,000, a warning is emitted and the kernel is computed on a 5,000-observation subsample. The remaining observations are assigned to clusters via nearest-neighbour. This is an approximation — cluster boundaries may shift slightly relative to the full-data solution.

Installation. CausalClusteringAnalyzer is part of the causal_forest subpackage:

uv add "insurance-causal[causal_forest]"

Rate change evaluation (v0.6.0)

DML and causal forests answer a forward-looking question: given the data we have, what is the causal effect of treatment? The rate_change sub-package answers a different question: we implemented a rate change six months ago — did it work, and by how much?

This is post-hoc causal evaluation. The methods — Difference-in-Differences and Interrupted Time Series — are standard in policy evaluation and health econometrics. The insurance application has specific wrinkles: loss ratios are exposure-weighted, treatment selection is correlated with risk quality (segments with deteriorating loss ratios get larger rate increases), and the usual parallel trends assumption needs checking against UK market shocks such as the Ogden rate change or whiplash reform.

RateChangeEvaluator handles both methods through a single interface. It selects DiD automatically when a control group is present (segments or territories that did not receive the rate change), and falls back to ITS when the entire book was treated simultaneously.

from insurance_causal.rate_change import RateChangeEvaluator, make_rate_change_data

# Synthetic panel: 10,000 policies, 12 quarters, rate change in Q7
# treated=1 for segments that received a 10% rate increase
df = make_rate_change_data(n_policies=10_000, true_att=-0.03, random_state=42)

evaluator = RateChangeEvaluator(
    method="auto",           # DiD if control group present, ITS otherwise
    outcome_col="loss_ratio",
    period_col="period",
    treated_col="treated",
    change_period=7,         # the quarter the rate change took effect
    exposure_col="exposure",
    unit_col="segment_id",
)

result = evaluator.fit(df).summary()
print(result)

Example output:

Rate Change Evaluation Result
  Method:          DiD (Difference-in-Differences)
  Outcome:         loss_ratio
  ATT:             -0.0298
  ATT (%):         -4.8% of pre-treatment mean
  SE:               0.0091
  95% CI:          (-0.0477, -0.0120)
  p-value:          0.001
  Parallel trends: p=0.412 (pre-treatment test passes)
  Pre-treatment mean (treated): 0.621
  N treated obs:   72  |  N control obs: 48
  Periods pre/post: 6 / 6

Per-segment analysis and diagnostics:

# Event study: pre-treatment coefficients should cluster near zero
evaluator.plot_event_study()

# Pre/post observed outcomes: treated vs control over time
evaluator.plot_pre_post()

# Formal parallel trends test: joint F-test on pre-treatment period dummies
pt = evaluator.parallel_trends_test()
print(pt.joint_pt_fstat, pt.joint_pt_pvalue)

ITS (whole-book evaluation). When no control group exists — the entire book received the rate change simultaneously — use ITS. Set method="its" or leave method="auto" and omit treated_col:

from insurance_causal.rate_change import make_its_data

df_ts = make_its_data(n_periods=16, true_level_shift=-0.04, random_state=42)

evaluator_its = RateChangeEvaluator(
    method="its",
    outcome_col="loss_ratio",
    period_col="quarter",
    change_period="2023Q3",   # accepts quarter strings or integers
    exposure_col="earned_years",
)
result_its = evaluator_its.fit(df_ts).summary()

ITS fits a segmented regression (level shift + slope change) with Newey-West HAC standard errors for autocorrelation, and quarterly seasonality dummies. The level shift is the primary estimate — the immediate effect of the rate change on the outcome, holding the pre-treatment trend constant.

Key classes:

  • RateChangeEvaluator — main entry point; fits DiD or ITS; exposes .fit(), .summary(), .plot_event_study(), .plot_pre_post(), .parallel_trends_test()
  • RateChangeResult — structured result dataclass with ATT, SE, CI, p-value, method metadata, and list of any estimation warnings
  • DiDResult — detailed DiD output including event study coefficients, staggered adoption detection flag, cluster SE details
  • ITSResult — detailed ITS output including level shift, slope change, and counterfactual trend parameters
  • UK_INSURANCE_SHOCKS — reference dict of known UK market shocks for confounder warnings (Ogden rate changes, whiplash reform, FCA pricing review)

When to use DiD vs ITS. If your portfolio has segments, territories, or channels that were unaffected by the rate change, use DiD — the control group absorbs time trends and macro shocks that would otherwise be attributed to the rate change. ITS is appropriate when the change was book-wide and simultaneous; it relies on the pre-treatment trend being stable and well-estimated, which requires at least 4-6 pre-treatment periods (the default min_pre_periods=4 enforces this).

Known limitation. Both DiD and ITS assume no spillover effects (the SUTVA assumption). In renewal pricing, if control segments and treated segments compete for the same customers via aggregators, a rate change in treated segments can shift demand to control segments, biasing the control group outcome. Check for volume changes in control segments alongside loss ratio changes.


Sensitivity analysis

How strong would an unobserved confounder need to be to overturn the result?

WARNING — heuristic approximation. The sensitivity_analysis() function uses a simplified bound: bias_bound = log(gamma) * se. This is not the classical Rosenbaum rank-based test on matched studies — it is a heuristic applied to the DML point estimate and standard error. For a rigorous sensitivity analysis, see the sensemakr package (Python port available), which implements the Cinelli-Hazlett (2020) partial R-squared bounds. The heuristic here is sufficient for directional guidance but should not be cited as a formal Rosenbaum bound.

from insurance_causal.diagnostics import sensitivity_analysis

ate = model.average_treatment_effect()
report = sensitivity_analysis(
    ate=ate.estimate,
    se=ate.std_error,
    gamma_values=[1.0, 1.25, 1.5, 2.0, 3.0],
)
print(report[["gamma", "conclusion_holds", "ci_lower", "ci_upper"]])

The sensitivity parameter gamma represents the odds ratio of treatment for two units with identical observed confounders. Gamma = 1 is no unobserved confounding; gamma = 2 means an unobserved factor doubles the treatment odds for some units. If conclusion_holds becomes False at gamma = 1.25, the result is fragile. If it holds to gamma = 2.0, the result is robust.

This function uses heuristic sensitivity bounds inspired by Rosenbaum's framework: bias_bound = log(gamma) * se. This is an approximation applied to the DML point estimate and standard error, rather than the classical Rosenbaum rank-based test on matched studies. For a more rigorous sensitivity analysis approach, see Cinelli and Hazlett (2020), "Making Sense of Sensitivity: Extending Omitted Variable Bias."


The maths, briefly

DML estimates the partially linear model:

Y = theta_0 * D + g_0(X) + epsilon
D = m_0(X) + V

Where theta_0 is the causal effect of treatment D on outcome Y, g_0(X) is an unknown nonlinear confounder effect, and m_0(X) is the conditional expectation of treatment given confounders.

The estimation procedure:

  1. Fit E[Y|X] using CatBoost (with 5-fold cross-fitting). Compute residuals Y_tilde = Y - E_hat[Y|X].
  2. Fit E[D|X] using CatBoost (with 5-fold cross-fitting). Compute residuals D_tilde = D - E_hat[D|X].
  3. Regress Y_tilde on D_tilde via OLS. The coefficient is theta_hat.

Step 3 is just OLS, which gives valid standard errors and confidence intervals. The cross-fitting in steps 1-2 ensures that nuisance estimation errors are asymptotically orthogonal to the score, so they do not bias theta_hat. This is the Neyman orthogonality property that makes DML valid even when the nuisance models are regularised ML estimators.

The result: theta_hat is root-n-consistent and asymptotically normal, with a valid 95% CI. This is not possible with naive ML plug-in estimators.

This PLR model assumes a constant treatment effect theta_0. The autodml subpackage and PremiumElasticity estimator go further, estimating heterogeneous effects and Average Marginal Effects using the Riesz representer minimax approach (Chernozhukov et al. 2022). These are different estimands: PLR gives a single number theta_0; AME integrates heterogeneous marginal effects over the covariate distribution. For most pricing applications the AME is the more useful quantity.


Why CatBoost for nuisance models?

The nuisance models E[Y|X] and E[D|X] need to be flexible nonlinear estimators that converge at n^{-1/4} or faster — a condition satisfied by well-tuned gradient boosted trees. A 2024 systematic evaluation (ArXiv 2403.14385) found that gradient boosted trees outperform LASSO in the DML nuisance step when confounding is genuinely nonlinear — which it is for insurance data with postcode effects and interaction of age with vehicle type.

CatBoost is the default because it handles categorical features natively (postcode band, vehicle group, occupation class) without label encoding, and its ordered boosting reduces target leakage from high-cardinality categoricals.

From v0.3.0, the nuisance model capacity is sample-size adaptive. The default configuration at 20k observations is 350 trees, depth 6; at 5k observations it drops to 150 trees, depth 5 with L2 regularisation (l2_leaf_reg=5.0). This prevents over-partialling — where CatBoost absorbs treatment signal into the nuisance residuals on small samples, leaving the final DML regression with near-zero treatment variance to identify from. The capacity schedule:

n range iterations depth l2_leaf_reg
< 2,000 100 4 10.0
2,000–5,000 150 5 5.0
5,000–10,000 200 5 3.0
10,000–50,000 350 6 3.0
≥ 50,000 500 6 3.0

To override: CausalPricingModel(..., nuisance_params={"iterations": 200, "depth": 4}).


Expected performance

On a 50,000-policy synthetic UK motor book with multiplicative confounding (age x NCB x region interaction driving both pricing decisions and renewal probability):

  • Naive GLM overestimates the treatment effect by 50–90% in confounded segments, and its 95% CI does not cover the true effect
  • DML reduces bias to 10–20% of the true effect with valid confidence intervals that cover the true value
  • Per-policy CATE estimates from causal_forest enable individual targeting vs segment averages, with formal heterogeneity tests (BLP, GATES, AUTOC)

The confounding mechanism: high-risk customers receive larger price increases and have lower baseline renewal rates independently of price. A GLM with main effects sees both effects superimposed and overstates price sensitivity. CatBoost nuisance models in the DML step recover the multiplicative interaction and partial it out.

Run uv run python benchmarks/run_benchmark.py or import notebooks/databricks_validation.py into Databricks for the full comparison.

Small-sample performance (v0.3.0+)

The primary motivation for v0.3.0 was fixing DML's performance at typical UK insurance small-book sizes: 1k–10k policies. The original implementation (v0.2.x) used CatBoost with fixed parameters (500 iterations, depth 6) regardless of sample size. On small samples, this caused over-partialling: the nuisance model for E[Y|X] became flexible enough to absorb treatment signal, leaving the DML regression step with near-zero residual treatment variance. The result was a biased ATE estimate — in benchmark runs, DML was worse than a naive GLM at n=5k.

The fix is sample-size-adaptive nuisance parameters. At n=5k the library now uses 150 trees, depth 5, l2_leaf_reg=5.0 — aggressive enough to model confounding structure but not so flexible that it eliminates the treatment residual signal.

Small-sample sweep results (synthetic UK motor DGP, true effect = −0.15, 5 replication seeds):

n Naive GLM bias DML v0.2.x bias DML v0.3.0 bias Improvement
1,000 typical 20–40% 60–90% (over-partial) 15–35% 30–50 pp
2,000 typical 20–40% 40–70% 10–25% 25–40 pp
5,000 typical 15–30% 30–55% 8–20% 20–35 pp
10,000 typical 10–25% 15–35% 5–15% 10–20 pp
20,000 typical 8–20% 8–20% 5–12% ~5 pp
50,000 typical 5–15% 5–10% 4–10% negligible

Results show variance across seeds — run notebooks/benchmark.py (Section 12) for exact figures on your cluster.

Headline benchmark (n=20,000, unobserved confounder DGP)

Benchmarked against a naive Poisson GLM on synthetic UK motor data with a known ground-truth treatment effect of −0.15. Full methodology: notebooks/benchmark.py.

The DGP includes an unobserved driving behaviour score that drives both treatment selection (careful drivers self-select into telematics) and claim frequency. The GLM controls for all observed rating factors (age, vehicle value, postcode risk) but cannot see the latent driving score. DML's non-linear nuisance models partially proxy the unobserved channel through the observed covariates.

This produces a clear, commercially meaningful gap: a naive GLM overstates the treatment effect by 15–20%. A pricing team using the GLM estimate to calibrate the telematics discount would set it 15–20% too aggressively.

Run on Databricks serverless, 2026-03-21, seed=42, n=20,000:

Metric Naive Poisson GLM DML (insurance-causal)
Estimate biased towards −0.18 converges to −0.15
True DGP effect −0.1500 −0.1500
Bias (% of true) ~15–20% ~2–5%
95% CI covers truth? No Yes
Fit time <1s ~60s (5-fold CatBoost)

Run notebooks/benchmark.py for exact figures — results vary slightly by seed.

When to use: When the treatment was not randomly assigned — which is almost always true in insurance (telematics, renewal pricing, channel, campaign). DML removes the confounding bias that a standard GLM carries silently.

When NOT to use: Genuinely random treatment (A/B test with proper randomisation). Also not appropriate when treatment variation is nearly deterministic — the residualised treatment will have near-zero variance and estimates will be unstable.

Minimum practical sample size: n ≈ 1,000 with cv_folds=3. Below this the confidence intervals are too wide to be commercially useful. At n < 500 per segment in cate_by_segment(), the library warns and reduces CatBoost iterations automatically.

Causal forest vs GLM interaction model (v0.4.0+)

The relevant comparison for a pricing team evaluating causal forest adoption is not "forest vs ignoring heterogeneity". It is "forest vs the approach we already use": a Poisson GLM with treatment × segment interaction terms.

DGP: 20,000 UK motor policies, Poisson frequency outcome (12% base rate). True log-scale price semi-elasticities vary by age band × urban status: young urban −5.0, senior rural −0.8. Treatment (log price change) is confounded by risk profile.

Both estimators target the same estimand: the log-scale elasticity per segment. The GLM interaction model is well-matched to the DGP (the DGP is log-linear Poisson). This puts the causal forest at a disadvantage relative to real portfolios, where confounding interactions are genuinely nonlinear.


Limitations

  • Unobserved confounders invalidate the estimate. DML removes bias from observed confounders; if attitude to risk, actual mileage, or claim reporting behaviour are unobserved and correlated with both price change and outcome, the result is still biased. Run sensitivity_analysis() to understand how large an unobserved confounder would need to be to overturn your conclusion. Note: sensitivity_analysis() uses a heuristic bound — see the docstring warning before citing it as a formal Rosenbaum bound.
  • Near-deterministic treatment destroys identification. If your price changes are almost entirely rule-based, the DML cross-fitting step leaves near-zero residual treatment variance. The resulting confidence interval will be very wide — correctly so, because there is genuinely little exogenous variation to identify from.
  • Including mediators as confounders attenuates estimates. NCB is partly caused by the claim experience driven by the risk factors you are studying. Adding it to the confounder list blocks the causal channel. Draw the DAG before specifying the model.
  • Small samples produce unreliable CATE estimates from the causal forest. HeterogeneousElasticityEstimator requires at least 5,000 observations. Below this, honest splitting combined with 5-fold cross-fitting leaves too few training observations per tree.
  • Computation scales with portfolio size and cross-fitting folds. At 100k observations with 10 confounders and 5 folds, expect 5–15 minutes on a standard Databricks cluster. cv_folds=3 halves this at the cost of slightly noisier standard errors.

References

  1. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018). "Double/Debiased Machine Learning for Treatment and Structural Parameters." The Econometrics Journal, 21(1): C1-C68. ArXiv: 1608.00060

  2. Bach, P., Chernozhukov, V., Kurz, M.S., Spindler, M. and Klaassen, S. (2024). "DoubleML: An Object-Oriented Implementation of Double Machine Learning in R." Journal of Statistical Software, 108(3): 1-56. docs.doubleml.org

  3. Chernozhukov, V. et al. (2022). "Automatic Debiased Machine Learning of Causal and Structural Effects." Econometrica, 90(3): 967-1027. ArXiv: 2006.10576

  4. Guelman, L. and Guillen, M. (2014). "A causal inference approach to measure price elasticity in automobile insurance." Expert Systems with Applications, 41(2): 387-396.

  5. Chernozhukov, V. et al. (2024). "Applied Causal Inference Powered by ML and AI." causalml-book.org

  6. Cinelli, C. and Hazlett, C. (2020). "Making Sense of Sensitivity: Extending Omitted Variable Bias." Journal of the Royal Statistical Society: Series B, 82(1): 39-67.


Notebooks and examples

Quickstartexamples/quickstart.py: end-to-end worked example showing the confounding problem, DML estimation, and bias comparison on a 10,000-policy synthetic UK motor book. Runs in under 3 minutes on Colab or Databricks.

Full demo notebooks (Databricks .py format, import via Repos):

Notebook What it covers
notebooks/01_insurance_causal_demo.py Core DML, confounding bias report, sensitivity analysis
notebooks/02_autodml_demo.py Riesz representer AME, dose-response curve
notebooks/03_elasticity_demo.py Renewal pricing optimisation, ENBP constraint
notebooks/04_causal_forest_hte_demo.py CATEs, BLP/GATES/CLAN, RATE/AUTOC
notebooks/05_rate_change_evaluator_demo.py DiD and ITS post-hoc evaluation

A ready-to-run Databricks notebook benchmarking this library against standard approaches is available in burning-cost-examples.


Other Burning Cost libraries

Model building

Library Description
shap-relativities Extract rating relativities from GBMs using SHAP
insurance-interactions Automated GLM interaction detection via CANN and NID scores
insurance-cv Walk-forward cross-validation respecting IBNR structure

Uncertainty quantification

Library Description
insurance-conformal Distribution-free prediction intervals for Tweedie models
bayesian-pricing Hierarchical Bayesian models for thin-data segments
insurance-credibility Bühlmann-Straub credibility weighting

Deployment and optimisation

Library Description
insurance-optimise Constrained rate change optimisation with FCA PS21/5 compliance
insurance-demand Conversion, retention, and price elasticity modelling

Governance

Library Description
insurance-fairness Proxy discrimination auditing for UK insurance models
insurance-monitoring Model monitoring: PSI, A/E ratios, Gini drift test

Spatial

Library Description
insurance-spatial BYM2 spatial territory ratemaking for UK personal lines

All libraries ->

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insurance_causal-0.6.2.tar.gz (304.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

insurance_causal-0.6.2-py3-none-any.whl (182.7 kB view details)

Uploaded Python 3

File details

Details for the file insurance_causal-0.6.2.tar.gz.

File metadata

  • Download URL: insurance_causal-0.6.2.tar.gz
  • Upload date:
  • Size: 304.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_causal-0.6.2.tar.gz
Algorithm Hash digest
SHA256 4fc301e441d627a5f7ccfa5ef3143402e3d6a542b39678c7308f74f379c4db4c
MD5 4d9d3802c99ca8935a12f590ba0842b1
BLAKE2b-256 357663fb2c625f6805647e8e9775370f73ca8bf67a95d32ad6d324ae129ad95f

See more details on using hashes here.

File details

Details for the file insurance_causal-0.6.2-py3-none-any.whl.

File metadata

  • Download URL: insurance_causal-0.6.2-py3-none-any.whl
  • Upload date:
  • Size: 182.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_causal-0.6.2-py3-none-any.whl
Algorithm Hash digest
SHA256 aed34795c0d0a00c0f1bcc6db1f9e4524bf7e4e2f64b939e7fa9c8e6a2c2c4ce
MD5 cd825ae8b9e866cb356d396f364cad82
BLAKE2b-256 aaa31e8e48a1784fc9e947744ace1a04a7fd293d9e1c938cac474d039dbb2580

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page