Lightweight OLS-based concept drift detector for regression and classification
Project description
Linear Drift Detector
A lightweight, explainable concept drift detection library based on linear coefficient analysis using OLS (Ordinary Least Squares).
This package is designed for both regression and classification models to detect when the underlying data relationship between features and targets has changed over time — i.e., when concept drift occurs.
📖 Table of Contents
- Introduction
- Core Idea
- Mathematical Foundation
- Algorithm Overview
- Installation
- Quick Start Example (Regression)
- Example (Classification)
- Output Details
- Interpretation
- When to Use
- Limitations
- License
Introduction
In many deployed ML systems, the relationship between inputs (X) and target (y) evolves over time.
This evolution — often subtle — causes concept drift, where a model trained on historical data no longer reflects the true structure of incoming (production) data.
Instead of retraining blindly, it’s essential to detect when this drift occurs.
That’s where the linear-drift-detector helps: it quantifies the change in feature relationships using OLS regression coefficients.
Core Idea
Even if your production model is nonlinear (like Random Forest or XGBoost),
we can still proxy the structural relationship between features and targets using a simple linear fit.
We fit an OLS model on:
- The training dataset (
X_train,y_train) - The production dataset (
X_prod,y_prod_prediction— actual labels or predicted outputs)
Then, we compare the learned coefficients.
If the coefficients shift significantly between the two datasets,
it indicates a potential concept drift in the data-generating process.
Mathematical Foundation
Let the relationship between target y and features X be modeled as:
$$ y = X\beta + \epsilon $$
Where:
- $X$: Feature matrix
- $\beta$: Coefficient vector
- $\epsilon$: Random noise term
We fit two models:
$$ \hat{\beta}{train} = (X{train}^T X_{train})^{-1} X_{train}^T y_{train} $$
$$ \hat{\beta}{prod} = (X{prod}^T X_{prod})^{-1} X_{prod}^T y_{prod} $$
Then compute:
$$ \Delta \beta = \hat{\beta}{prod} - \hat{\beta}{train} $$
To statistically test if the difference is significant:
$$ Z_i = \frac{\hat{\beta}{prod,i} - \hat{\beta}{train,i}}{\sqrt{SE_{train,i}^2 + SE_{prod,i}^2}} $$
Where $SE$ is the standard error of each coefficient.
The two-tailed p-value is computed as:
$$ p_i = 2(1 - \Phi(|Z_i|)) $$
Algorithm Overview
- Input:
X_train,y_train: historical (training) dataX_prod,y_prod_prediction: production data and outputs (actual or predicted)
- Fit two OLS models:
model_train = OLS(y_train, X_train)model_prod = OLS(y_prod_prediction, X_prod)
- Extract coefficients and standard errors
- Compute difference metrics:
- Δβ (coefficient shift)
- L2 norm distance
- Z-test and p-values for statistical significance
- Return diagnostic report
Installation
pip install linear-drift-detector
Quick Start Example (Regression)
import numpy as np
from linear_drift_detector import linear_coefficient_shift
# Generate training data
np.random.seed(42)
X_train = np.random.randn(200, 3)
y_train = 3*X_train[:,0] - 2.5*X_train[:,1] + 4*X_train[:,2] + np.random.randn(200)*0.5
# Generate production data (shifted relationships)
X_prod = np.random.randn(200, 3)
y_prod_pred = 4*X_prod[:,0] - 1.5*X_prod[:,1] + 5*X_prod[:,2] + np.random.randn(200)*0.5
# Run drift detection
result = linear_coefficient_shift(X_train, y_train, X_prod, y_prod_pred)
# Print diagnostic outputs
print(result["z_test"])
print("L2 Distance:", result["l2_distance"])
Example Output
| coef_train | coef_prod | diff | z_value | p_value | |
|---|---|---|---|---|---|
| const | 0.01234 | 0.02345 | 0.01111 | 0.28 | 0.776 |
| x1 | 2.98456 | 3.99123 | 1.00667 | 5.12 | 0.000 |
| x2 | -2.48721 | -1.49834 | 0.98887 | 4.97 | 0.000 |
| x3 | 4.01245 | 4.98678 | 0.97433 | 4.54 | 0.000 |
L2 Distance: 1.71
Interpretation: Significant p-values (< 0.05) and large L2 distance indicate a strong concept drift.
Example (Classification)
Even for classification tasks, OLS can be used as a proxy detector for internal data structure shifts.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from linear_drift_detector import linear_coefficient_shift
# Training data
X_train, y_train = make_classification(
n_samples=200, n_features=3, n_informative=3, n_redundant=0, random_state=42
)
# Production data with changed separation
X_prod, y_prod = make_classification(
n_samples=200, n_features=3, n_informative=3, n_redundant=0, class_sep=1.5, random_state=99
)
# Simulate model predictions
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_prod_pred = clf.predict_proba(X_prod)[:, 1]
# Detect drift
result = linear_coefficient_shift(X_train, y_train, X_prod, y_prod_pred)
print(result["z_test"])
print("L2 Distance:", result["l2_distance"])
Here, the production dataset has a different internal structure, and the drift detector highlights this through coefficient divergence. The output is similar to regression.
Output Details
The function returns a dictionary:
| Key | Description |
|---|---|
coef_train |
Coefficients from training OLS |
coef_prod |
Coefficients from production OLS |
coef_diff |
Difference vector (production - training) |
l2_distance |
Magnitude of coefficient drift |
z_test |
DataFrame with z-values and p-values for each coefficient |
Interpretation
High L2 Distance: overall structural shift in data
Low p-values (< 0.05): statistically significant coefficient drift
Large Δβ: feature relationship changed
Stable coefficients: no significant drift
When to Use
Monitor deployed regression or classification models
Detect data drift when retraining is expensive
Quantify how much internal data relationship has changed
Build interpretability into data drift detection pipelines
Limitations
OLS assumes a linear relationship — may not match nonlinear models
Requires same feature dimensionality (X_train.shape == X_prod.shape)
Sensitive to scaling (consider standardizing features)
Works best as a proxy detector, may not as a perfect substitute for full statistical drift tests
License
MIT License © 2025 Developed for the open-source data science community.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file linear_drift_detector-0.1.0.tar.gz.
File metadata
- Download URL: linear_drift_detector-0.1.0.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30511abb8e982ef72b2fc1db3c65cb52a763255650a8952949d8a3768e0eb233
|
|
| MD5 |
cf84da781c5fc227b2ef34fde03b356d
|
|
| BLAKE2b-256 |
c279dca7c0a75547856c9257051ba04832b92c4796e6f76e230f2af205693284
|
File details
Details for the file linear_drift_detector-0.1.0-py3-none-any.whl.
File metadata
- Download URL: linear_drift_detector-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e2dab87c715b15a074e1d811c842c4f6b27850de49f5a303c782145eb942841
|
|
| MD5 |
0bd4b7da2699ea4e8c597db286825f2e
|
|
| BLAKE2b-256 |
10128d6429a37aa044aae34d96a35ad91f9579980f3e4e63844ab473d41a7ffa
|