ML-Enhanced EM Nowcasting for insurance claims reporting delays

These details have not been verified by PyPI

Project links

Project description

insurance-nowcast

ML-Enhanced EM Nowcasting for insurance claims reporting delays.

Pricing actuaries routinely face a problem that has no good Python solution: the most recent 6–24 months of experience data is partially developed — claims have occurred but not yet been reported. Applying aggregate completion factors from a reserving triangle ignores that reporting delay varies by risk characteristics. A young driver making a motor BI claim has a different reporting delay than a fleet driver making a motor PD claim.

This library implements the Wilsens/Antonio/Claeskens (arXiv:2512.07335) ML-EM algorithm, adapted for insurance pricing, to produce covariate-conditioned completion factors and IBNR counts by risk segment.

The problem in concrete terms

You're fitting a frequency GLM on 3 years of motor BI data. Your training data extract is as of 31 December 2024. Policies from Q4 2024 have been exposed for 1–3 months — but motor BI claims have a median reporting delay of 4 months. This means roughly 50–60% of claims from Q4 2024 are still unreported. If you feed raw claim counts into your GLM, Q4 2024 will appear to be a low-frequency quarter, and your model will be biased.

Standard practice is to apply aggregate development factors from the reserving team's triangle. This is better than nothing, but:

The factors come from aggregate data and don't condition on risk mix
If your recent business has a different risk profile than historical average, the aggregate factor is wrong
You can't quantify the uncertainty in the completion factor

This library solves all three problems.

Install

pip install insurance-nowcast

For diagnostic plots:

pip install "insurance-nowcast[plots]"

Quick start

from insurance_nowcast import ReportingDelayModel, NowcastSimulator

# Generate synthetic data to test
sim = NowcastSimulator(
    n_occurrence_periods=24,
    max_delay_periods=12,
    base_frequency=0.08,
    delay_shape="geometric",
)
df = sim.generate(n_policies=2000, eval_period=23)

# Fit the model
model = ReportingDelayModel(
    occurrence_model="xgboost",
    delay_model="xgboost",
    max_delay_periods=12,
    verbose=True,
)
model.fit(
    df,
    occurrence_col="occurrence_period",
    report_col="report_period",
    exposure_col="exposure",
    feature_cols=["age_group", "risk_score", "channel"],
    eval_date=23,
)

# Get completion factors by occurrence period
cf = model.predict_completion_factors()
print(cf[["occurrence_period", "completion_factor", "ibnr_count"]])

# Get IBNR counts
ibnr = model.predict_ibnr()
print(f"Total IBNR: {ibnr['ibnr_count'].sum():.1f} claims")

# Segment-level completion factors (for GLM adjustment)
cf_by_channel = model.predict_completion_factors(df=df, by=["channel"])

Input data format

The model expects individual claims data with one row per claim event:

Column	Type	Description
`occurrence_period`	int	Period when the claim occurred (e.g., month as integer)
`report_period`	float/int, nullable	Period when claim was reported. Null = IBNR
`exposure`	float	Policy exposure for this claim (policy-years at risk)
feature columns	float/int	Risk covariates — must be numeric

This mirrors individual claims data that pricing teams already maintain. No triangle aggregation required.

Algorithm

The model implements the EM algorithm from Wilsens, Antonio & Claeskens (arXiv:2512.07335):

Joint Poisson-Multinomial model:

Occurrence: N_i ~ Poisson(λ(xᵢ) × exposure_i)
Delay: N_{i,j} | N_i ~ Multinomial(p_j(xᵢ))

E-step: For censored periods (j ≥ τᵢ), impute: N̂_{i,j}^{(k)} = λ̂^{(k-1)}(xᵢ) × p̂_j^{(k-1)}(xᵢ)

M-step: Fit XGBoost (or GLM) on imputed complete data for:

Occurrence: Poisson regression with exposure offset
Delay: Multinomial softmax regression

XGBoost additive construction: New trees are added to the previous model at each EM iteration rather than refitting from scratch. This is the key contribution of the Wilsens paper — it provides de facto monotone likelihood improvement structurally similar to classical EM's guarantee.

Insurance adaptation: The original paper has no exposure offset. This library adds log(exposure) as an offset in the Poisson occurrence model via XGBoost's base_margin parameter. This is essential for pricing use — without it, the occurrence model conflates claim frequency rate with exposure volume.

Model parameters

ReportingDelayModel(
    occurrence_model="xgboost",    # "glm" or "xgboost"
    delay_model="xgboost",         # "glm" or "xgboost"
    max_delay_periods=24,          # Set to 95th-99th percentile of observed delays
    exposure_offset=True,          # Always True for pricing use
    em_patience=10,                # Stop if LL doesn't improve for 10 iterations
    max_em_iterations=50,          # Hard upper limit
    convergence_tol=1e-4,          # Minimum LL improvement to reset patience
    n_bootstrap=100,               # Bootstrap replications for CIs; 0 to skip
    bootstrap_confidence=0.90,     # 90% CI by default
)

Choosing max_delay_periods

This is the most important parameter. Set it too small and you'll understate IBNR. Typical values by UK line:

Line	Suggested max_delay_periods
Motor property damage	6 months
Motor bodily injury	18–24 months
Employers' liability	36–48 months
Public liability	24–36 months
Professional indemnity	36–60 months

The model will warn if >10% of observed delays are at or beyond the boundary.

When to use GLM vs XGBoost

Use occurrence_model="glm", delay_model="glm" when:

Portfolio is small (<5,000 claims)
Interpretability is important
You want a baseline to compare against

Use occurrence_model="xgboost", delay_model="xgboost" when:

Portfolio is large (>10,000 claims)
You expect non-linear effects on delay speed (e.g., claim type × territory)
Per the Wilsens paper experiments, XGBoost outperforms GLM on non-linear data

Diagnostics

from insurance_nowcast import ReportingDelayDiagnostic

diag = ReportingDelayDiagnostic()
diag.plot_convergence(model)             # EM log-likelihood by iteration
diag.plot_development_pattern(model)    # Cumulative delay curves by period
diag.plot_ibnr_by_period(model)         # Observed vs IBNR bar chart
diag.plot_delay_distribution(model, X)  # Delay PMF by risk profile

What this is not

This is a pricing tool, not a reserving tool. The outputs are:

Completion factors for adjusting claim counts in a pricing GLM training dataset
IBNR counts for understanding development loading by segment

The numbers should be comparable to the reserving team's LDFs. If they diverge materially, that's worth investigating — but don't present these as financial reserves.

The model handles IBNR (unreported claims) only, not RBNS (reported but not settled). For pricing frequency models, this is sufficient — we need ultimate claim counts, not ultimate paid amounts.

References

Wilsens, Antonio, Claeskens (2024): arXiv:2512.07335 — the ML-EM framework this implements
Verbelen, Antonio, Claeskens, Crevecoeur (2022): Statistical Science 37(3) — the foundational GLM-EM paper
Hiabu, Hofman, Pittarello (2023): arXiv:2312.14549 — parallel survival analysis approach (R package: ReSurv)

Development

git clone https://github.com/burning-cost/insurance-nowcast
cd insurance-nowcast
uv sync --all-extras
uv run pytest tests/ -v

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insurance_nowcast-0.1.0.tar.gz (149.2 kB view details)

Uploaded Mar 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

insurance_nowcast-0.1.0-py3-none-any.whl (35.6 kB view details)

Uploaded Mar 11, 2026 Python 3

File details

Details for the file insurance_nowcast-0.1.0.tar.gz.

File metadata

Download URL: insurance_nowcast-0.1.0.tar.gz
Upload date: Mar 11, 2026
Size: 149.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_nowcast-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7303dc97c7d73efc1c35b489cb94d4c77a4d870579e653d984e499c11339fac4`
MD5	`bb38cbdce35a7622f3692d5a817680c0`
BLAKE2b-256	`3a24b60ed554ecae05f2b6dcca7774aac8d27d55c0f84969147c85bc5da83323`

See more details on using hashes here.

File details

Details for the file insurance_nowcast-0.1.0-py3-none-any.whl.

File metadata

Download URL: insurance_nowcast-0.1.0-py3-none-any.whl
Upload date: Mar 11, 2026
Size: 35.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_nowcast-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f9406a86c5184f3f1185bd07d4192fc640f3c8aab8cb791002f8368457b21852`
MD5	`c96ea8bdca3c585379a74bfdff839294`
BLAKE2b-256	`7dcf7f804ae4d6e1ccd33a9e7a9ccf17e50a4aefea1c181a83ada1af5d930076`

See more details on using hashes here.

insurance-nowcast 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

insurance-nowcast

The problem in concrete terms

Install

Quick start

Input data format

Algorithm

Model parameters

Choosing max_delay_periods

When to use GLM vs XGBoost

Diagnostics

What this is not

References

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes