Telematics insurance pricing: HMM driving state classification and GLM risk scoring from raw trip data for usage-based insurance (UBI) and pay-how-you-drive (PHYD) products.
Project description
insurance-telematics
Turn raw GPS and accelerometer trip data into GLM-ready driver risk features using Hidden Markov Models — auditable, credibility-weighted, and explainable to the FCA.
Why this?
Raw telematics features — mean speed, harsh braking counts — treat a single motorway run as equivalent to a persistent driving style. HMM state classification separates trip-level noise from genuine behavioural regimes (cautious, normal, aggressive), and the fraction of time in the aggressive state is more predictive of claim frequency than raw averages alone (Jiang & Shi, 2024, NAAJ). Unlike vendor scores, every feature is auditable: you can show a regulator exactly which behaviours drive the output.
Blog post: HMM-Based Telematics Risk Scoring for Insurance Pricing
Quickstart
uv add insurance-telematics
from insurance_telematics import TripSimulator, TelematicsScoringPipeline
sim = TripSimulator(seed=42)
trips_df, claims_df = sim.simulate(n_drivers=100, trips_per_driver=50)
pipe = TelematicsScoringPipeline(n_hmm_states=3)
pipe.fit(trips_df, claims_df)
predictions = pipe.predict(trips_df)
No raw data yet? TripSimulator generates a realistic synthetic fleet — three driving regimes, Ornstein-Uhlenbeck speed processes, synthetic Poisson claims — so you can prototype before your data arrives.
Use cases
1. Trip scoring for a new-to-telematics portfolio
Score each trip and aggregate to driver level with Bühlmann-Straub credibility weighting. Drivers with fewer than 10 trips fall back to portfolio means automatically.
from insurance_telematics import load_trips, clean_trips, extract_trip_features
from insurance_telematics import aggregate_to_driver
trips = load_trips("trips.parquet")
features = extract_trip_features(clean_trips(trips))
driver_risk = aggregate_to_driver(features, credibility_threshold=30)
# driver_risk: one row per driver_id, GLM-ready
2. HMM state classification — extracting driving regime features
Classify each trip into latent driving states and derive the regime fractions that feed your Poisson GLM.
from insurance_telematics import DrivingStateHMM
hmm = DrivingStateHMM(n_states=3)
hmm.fit(features)
states = hmm.predict_states(features)
hmm_features = hmm.driver_state_features(features, states)
# hmm_features includes state_0_fraction, state_1_fraction, state_2_fraction per driver
With three states the HMM typically recovers: state 0 = cautious (low speed, urban), state 1 = normal (mixed), state 2 = aggressive (high speed variance, high harsh event rate). The state_2_fraction is the primary GLM covariate.
3. Variable trip length — continuous-time HMM
For portfolios where observation intervals are irregular (trips logged at variable Hz), use ContinuousTimeHMM to avoid biasing state estimates toward shorter trips.
from insurance_telematics import ContinuousTimeHMM
import numpy as np
time_deltas = np.array(features["trip_duration_min"])
cthmm = ContinuousTimeHMM(n_states=3)
cthmm.fit(features, time_deltas=time_deltas)
Full pipeline
Raw 1Hz trip data (CSV or Parquet)
→ load_trips() — load and schema-map
→ clean_trips() — GPS jump removal, acceleration derivation, road type
→ extract_trip_features() — harsh braking rate, speeding fraction, night fraction
→ DrivingStateHMM — classify each trip into latent driving states
→ aggregate_to_driver() — Bühlmann-Straub credibility weighting to driver level
→ TelematicsScoringPipeline — Poisson GLM producing predicted claim frequency
Input data format
One row per second (1Hz):
| Column | Type | Notes |
|---|---|---|
trip_id |
string | Unique per trip |
timestamp |
datetime | ISO 8601 or Unix epoch |
latitude |
float | Decimal degrees |
longitude |
float | Decimal degrees |
speed_kmh |
float | GPS speed |
acceleration_ms2 |
float | Optional — derived from speed if absent |
heading_deg |
float | Optional — used for cornering estimation |
driver_id |
string | Optional — "unknown" if absent |
Non-standard column names? Use schema:
trips = load_trips("raw_data.csv", schema={"gps_speed": "speed_kmh"})
Features extracted per trip
harsh_braking_rate— events/km where deceleration < −3.5 m/s²harsh_accel_rate— events/km where acceleration > +3.5 m/s²harsh_cornering_rate— events/km (estimated from heading-change rate)speeding_fraction— fraction of time exceeding road-type speed limitnight_driving_fraction— fraction of distance driven 23:00–05:00urban_fraction— fraction of time at speed < 50 km/hmean_speed_kmh,p95_speed_kmh,speed_variation_coeff
Compared to alternatives
| Vendor black-box | Raw feature averages | Manual threshold scoring | insurance-telematics | |
|---|---|---|---|---|
| Auditable methodology | No | Yes | Yes | Yes |
| Captures driving regimes | Possibly | No | Partial | Yes (HMM) |
| Handles sparse new drivers | Varies | No | No | Yes (credibility weighting) |
| GLM-ready output | Varies | Manual | Manual | Yes (Polars DataFrame) |
| FCA-explainable | No | Yes | Yes | Yes |
| Synthetic data for prototyping | No | No | No | Yes (TripSimulator) |
Validated performance
On a synthetic fleet of 5,000 drivers × 30 trips with a known 3-state DGP:
| Approach | Gini improvement | Feature computation |
|---|---|---|
| Raw summary features (mean speed, harsh events) | baseline | < 1s |
| Threshold-based scoring | +1–3pp | < 1s |
| HMM state fractions (this library) | +5–10pp | 30–90s |
state_2_fraction achieves Spearman rho ≥ 0.70 with the true aggressive fraction from the DGP. Correct identification of top-quartile high-risk drivers: > 50% (vs 25% at random). The HMM advantage is proportional to how regime-structured the true DGP is — on portfolios with continuously varying style, expect closer to 3pp.
Fit time: 30–90 seconds for 5,000 drivers × 30 trips on Databricks serverless. For fleets above 50,000 drivers, batch by cohort or use Spark UDFs.
Full validation notebook: notebooks/databricks_validation.py.
Limitations
- Below 10 trips per driver, state estimation variance is high. Use credibility-weighted summary features below this threshold.
- HMM state labels are not portable across separately fitted models. Do not compare raw state fractions between models fitted on different fleets or time periods.
urban_fractionis a time-fraction, not a distance-fraction. Document this before using it in ceded pricing where some reinsurers define urban exposure on a distance basis.
Part of the Burning Cost stack
Takes raw trip sensor data (GPS, accelerometer). Feeds HMM-scored, credibility-weighted driver-level features into insurance-gam and insurance-causal.
| Library | Role |
|---|---|
| insurance-gam | Smooth non-linear telematics score effects without discretising into bands |
| insurance-causal | DML — separates causal driving style effects from correlated demographics |
| insurance-fairness | FCA proxy discrimination auditing — telematics scores can proxy for protected characteristics |
| insurance-monitoring | Drift detection — monitors whether telematics-derived GLM factors remain calibrated |
| insurance-governance | Model validation and MRM governance — sign-off pack for telematics models in production |
References
Hidden Markov Model foundations
- Baum, L.E. & Petrie, T. (1966). "Statistical Inference for Probabilistic Functions of Finite State Markov Chains." The Annals of Mathematical Statistics, 37(6), 1554–1563. doi:10.1214/aoms/1177699147 (Original HMM formulation.)
- Baum, L.E., Petrie, T., Soules, G. & Weiss, N. (1970). "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains." The Annals of Mathematical Statistics, 41(1), 164–171. doi:10.1214/aoms/1177697196 (Baum-Welch EM algorithm for HMM parameter estimation.)
- Rabiner, L.R. (1989). "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition." Proceedings of the IEEE, 77(2), 257–286. doi:10.1109/5.18626 (Definitive reference for the forward-backward algorithm and Viterbi decoding.)
- Viterbi, A.J. (1967). "Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm." IEEE Transactions on Information Theory, 13(2), 260–269. doi:10.1109/TIT.1967.1054010 (Viterbi algorithm for most-probable state sequence decoding.)
Telematics insurance literature
- Jiang, Q. & Shi, Y. (2024). "Auto Insurance Pricing Using Telematics Data: Application of a Hidden Markov Model." North American Actuarial Journal, 28(4), 822–839. doi:10.1080/10920277.2023.2256657
- Wüthrich, M.V. (2017). "Covariate Selection from Telematics Car Driving Data." European Actuarial Journal, 7, 89–108. doi:10.1007/s13385-017-0149-z
- Gao, G., Wang, H. & Wüthrich, M.V. (2021). "Boosting Poisson Regression Models with Telematics Car Driving Data." Machine Learning, 111, 1787–1827. doi:10.1007/s10994-021-05957-0
- Henckaerts, R. & Antonio, K. (2022). "The Added Value of Dynamically Updating Motor Insurance Prices with Telematics Data." Insurance: Mathematics and Economics, 103, 79–95. doi:10.1016/j.insmatheco.2021.12.003
Community
- Questions? Start a Discussion
- Found a bug? Open an Issue
- Blog and tutorials: burning-cost.github.io
- Training course: Insurance Pricing in Python — Module 7 covers telematics. £97 one-time.
Licence
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file insurance_telematics-0.2.1.tar.gz.
File metadata
- Download URL: insurance_telematics-0.2.1.tar.gz
- Upload date:
- Size: 273.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3eabdeb0fb4c8f13c52b45c639235e74ccb6902fb462335327ca7e21fecfbc3
|
|
| MD5 |
602c7fb905ff311d376ede92c92c3dc6
|
|
| BLAKE2b-256 |
a623965ec2d9751f7ae8834b11b58cc33029b1eb6a172fd46eb151a8202a57cb
|
File details
Details for the file insurance_telematics-0.2.1-py3-none-any.whl.
File metadata
- Download URL: insurance_telematics-0.2.1-py3-none-any.whl
- Upload date:
- Size: 42.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e51042cf338764f5327c69f8c612af7d4d2287402d2b86a682cf1a38c594c87f
|
|
| MD5 |
1f515026d4336b45df60c87ec8c89491
|
|
| BLAKE2b-256 |
b3053ece808e1a57bc435e5c39d35f83fc6cf6dad7658c09cb3900cbc98e2d04
|