Skip to main content

Telematics insurance pricing: HMM driving state classification and GLM risk scoring from raw trip data for usage-based insurance (UBI) and pay-how-you-drive (PHYD) products.

Project description

insurance-telematics

Turn raw GPS and accelerometer trip data into GLM-ready driver risk features using Hidden Markov Models — auditable, credibility-weighted, and explainable to the FCA.

PyPI Python License

Why this?

Raw telematics features — mean speed, harsh braking counts — treat a single motorway run as equivalent to a persistent driving style. HMM state classification separates trip-level noise from genuine behavioural regimes (cautious, normal, aggressive), and the fraction of time in the aggressive state is more predictive of claim frequency than raw averages alone (Jiang & Shi, 2024, NAAJ). Unlike vendor scores, every feature is auditable: you can show a regulator exactly which behaviours drive the output.

Blog post: HMM-Based Telematics Risk Scoring for Insurance Pricing

Quickstart

uv add insurance-telematics
from insurance_telematics import TripSimulator, TelematicsScoringPipeline

sim = TripSimulator(seed=42)
trips_df, claims_df = sim.simulate(n_drivers=100, trips_per_driver=50)

pipe = TelematicsScoringPipeline(n_hmm_states=3)
pipe.fit(trips_df, claims_df)
predictions = pipe.predict(trips_df)

No raw data yet? TripSimulator generates a realistic synthetic fleet — three driving regimes, Ornstein-Uhlenbeck speed processes, synthetic Poisson claims — so you can prototype before your data arrives.

Use cases

1. Trip scoring for a new-to-telematics portfolio

Score each trip and aggregate to driver level with Bühlmann-Straub credibility weighting. Drivers with fewer than 10 trips fall back to portfolio means automatically.

from insurance_telematics import load_trips, clean_trips, extract_trip_features
from insurance_telematics import aggregate_to_driver

trips = load_trips("trips.parquet")
features = extract_trip_features(clean_trips(trips))
driver_risk = aggregate_to_driver(features, credibility_threshold=30)
# driver_risk: one row per driver_id, GLM-ready

2. HMM state classification — extracting driving regime features

Classify each trip into latent driving states and derive the regime fractions that feed your Poisson GLM.

from insurance_telematics import DrivingStateHMM

hmm = DrivingStateHMM(n_states=3)
hmm.fit(features)
states = hmm.predict_states(features)
hmm_features = hmm.driver_state_features(features, states)
# hmm_features includes state_0_fraction, state_1_fraction, state_2_fraction per driver

With three states the HMM typically recovers: state 0 = cautious (low speed, urban), state 1 = normal (mixed), state 2 = aggressive (high speed variance, high harsh event rate). The state_2_fraction is the primary GLM covariate.

3. Variable trip length — continuous-time HMM

For portfolios where observation intervals are irregular (trips logged at variable Hz), use ContinuousTimeHMM to avoid biasing state estimates toward shorter trips.

from insurance_telematics import ContinuousTimeHMM
import numpy as np

time_deltas = np.array(features["trip_duration_min"])
cthmm = ContinuousTimeHMM(n_states=3)
cthmm.fit(features, time_deltas=time_deltas)

Full pipeline

Raw 1Hz trip data (CSV or Parquet)
  → load_trips()            — load and schema-map
  → clean_trips()           — GPS jump removal, acceleration derivation, road type
  → extract_trip_features() — harsh braking rate, speeding fraction, night fraction
  → DrivingStateHMM         — classify each trip into latent driving states
  → aggregate_to_driver()   — Bühlmann-Straub credibility weighting to driver level
  → TelematicsScoringPipeline — Poisson GLM producing predicted claim frequency

Input data format

One row per second (1Hz):

Column Type Notes
trip_id string Unique per trip
timestamp datetime ISO 8601 or Unix epoch
latitude float Decimal degrees
longitude float Decimal degrees
speed_kmh float GPS speed
acceleration_ms2 float Optional — derived from speed if absent
heading_deg float Optional — used for cornering estimation
driver_id string Optional — "unknown" if absent

Non-standard column names? Use schema:

trips = load_trips("raw_data.csv", schema={"gps_speed": "speed_kmh"})

Features extracted per trip

  • harsh_braking_rate — events/km where deceleration < −3.5 m/s²
  • harsh_accel_rate — events/km where acceleration > +3.5 m/s²
  • harsh_cornering_rate — events/km (estimated from heading-change rate)
  • speeding_fraction — fraction of time exceeding road-type speed limit
  • night_driving_fraction — fraction of distance driven 23:00–05:00
  • urban_fraction — fraction of time at speed < 50 km/h
  • mean_speed_kmh, p95_speed_kmh, speed_variation_coeff

Compared to alternatives

Vendor black-box Raw feature averages Manual threshold scoring insurance-telematics
Auditable methodology No Yes Yes Yes
Captures driving regimes Possibly No Partial Yes (HMM)
Handles sparse new drivers Varies No No Yes (credibility weighting)
GLM-ready output Varies Manual Manual Yes (Polars DataFrame)
FCA-explainable No Yes Yes Yes
Synthetic data for prototyping No No No Yes (TripSimulator)

Validated performance

On a synthetic fleet of 5,000 drivers × 30 trips with a known 3-state DGP:

Approach Gini improvement Feature computation
Raw summary features (mean speed, harsh events) baseline < 1s
Threshold-based scoring +1–3pp < 1s
HMM state fractions (this library) +5–10pp 30–90s

state_2_fraction achieves Spearman rho ≥ 0.70 with the true aggressive fraction from the DGP. Correct identification of top-quartile high-risk drivers: > 50% (vs 25% at random). The HMM advantage is proportional to how regime-structured the true DGP is — on portfolios with continuously varying style, expect closer to 3pp.

Fit time: 30–90 seconds for 5,000 drivers × 30 trips on Databricks serverless. For fleets above 50,000 drivers, batch by cohort or use Spark UDFs.

Full validation notebook: notebooks/databricks_validation.py.

Limitations

  • Below 10 trips per driver, state estimation variance is high. Use credibility-weighted summary features below this threshold.
  • HMM state labels are not portable across separately fitted models. Do not compare raw state fractions between models fitted on different fleets or time periods.
  • urban_fraction is a time-fraction, not a distance-fraction. Document this before using it in ceded pricing where some reinsurers define urban exposure on a distance basis.

Part of the Burning Cost stack

Takes raw trip sensor data (GPS, accelerometer). Feeds HMM-scored, credibility-weighted driver-level features into insurance-gam and insurance-causal.

Library Role
insurance-gam Smooth non-linear telematics score effects without discretising into bands
insurance-causal DML — separates causal driving style effects from correlated demographics
insurance-fairness FCA proxy discrimination auditing — telematics scores can proxy for protected characteristics
insurance-monitoring Drift detection — monitors whether telematics-derived GLM factors remain calibrated
insurance-governance Model validation and MRM governance — sign-off pack for telematics models in production

References

Hidden Markov Model foundations

  • Baum, L.E. & Petrie, T. (1966). "Statistical Inference for Probabilistic Functions of Finite State Markov Chains." The Annals of Mathematical Statistics, 37(6), 1554–1563. doi:10.1214/aoms/1177699147 (Original HMM formulation.)
  • Baum, L.E., Petrie, T., Soules, G. & Weiss, N. (1970). "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains." The Annals of Mathematical Statistics, 41(1), 164–171. doi:10.1214/aoms/1177697196 (Baum-Welch EM algorithm for HMM parameter estimation.)
  • Rabiner, L.R. (1989). "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition." Proceedings of the IEEE, 77(2), 257–286. doi:10.1109/5.18626 (Definitive reference for the forward-backward algorithm and Viterbi decoding.)
  • Viterbi, A.J. (1967). "Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm." IEEE Transactions on Information Theory, 13(2), 260–269. doi:10.1109/TIT.1967.1054010 (Viterbi algorithm for most-probable state sequence decoding.)

Telematics insurance literature

  • Jiang, Q. & Shi, Y. (2024). "Auto Insurance Pricing Using Telematics Data: Application of a Hidden Markov Model." North American Actuarial Journal, 28(4), 822–839. doi:10.1080/10920277.2023.2256657
  • Wüthrich, M.V. (2017). "Covariate Selection from Telematics Car Driving Data." European Actuarial Journal, 7, 89–108. doi:10.1007/s13385-017-0149-z
  • Gao, G., Wang, H. & Wüthrich, M.V. (2021). "Boosting Poisson Regression Models with Telematics Car Driving Data." Machine Learning, 111, 1787–1827. doi:10.1007/s10994-021-05957-0
  • Henckaerts, R. & Antonio, K. (2022). "The Added Value of Dynamically Updating Motor Insurance Prices with Telematics Data." Insurance: Mathematics and Economics, 103, 79–95. doi:10.1016/j.insmatheco.2021.12.003

Community

Licence

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insurance_telematics-0.2.1.tar.gz (273.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

insurance_telematics-0.2.1-py3-none-any.whl (42.0 kB view details)

Uploaded Python 3

File details

Details for the file insurance_telematics-0.2.1.tar.gz.

File metadata

  • Download URL: insurance_telematics-0.2.1.tar.gz
  • Upload date:
  • Size: 273.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_telematics-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f3eabdeb0fb4c8f13c52b45c639235e74ccb6902fb462335327ca7e21fecfbc3
MD5 602c7fb905ff311d376ede92c92c3dc6
BLAKE2b-256 a623965ec2d9751f7ae8834b11b58cc33029b1eb6a172fd46eb151a8202a57cb

See more details on using hashes here.

File details

Details for the file insurance_telematics-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: insurance_telematics-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 42.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for insurance_telematics-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e51042cf338764f5327c69f8c612af7d4d2287402d2b86a682cf1a38c594c87f
MD5 1f515026d4336b45df60c87ec8c89491
BLAKE2b-256 b3053ece808e1a57bc435e5c39d35f83fc6cf6dad7658c09cb3900cbc98e2d04

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page