Skip to main content

Lightweight OLS-based concept drift detector for regression and classification

Project description

Linear Drift Detector

A lightweight, explainable concept drift detection library based on linear coefficient analysis using OLS (Ordinary Least Squares).
This package is designed for both regression and classification models to detect when the underlying data relationship between features and targets has changed over time — i.e., when concept drift occurs.


📖 Table of Contents

  1. Introduction
  2. Core Idea
  3. Mathematical Foundation
  4. Algorithm Overview
  5. Installation
  6. Quick Start Example (Regression)
  7. Example (Classification)
  8. Output Details
  9. Interpretation
  10. When to Use
  11. Limitations
  12. License

Introduction

In many deployed ML systems, the relationship between inputs (X) and target (y) evolves over time.
This evolution — often subtle — causes concept drift, where a model trained on historical data no longer reflects the true structure of incoming (production) data.

Instead of retraining blindly, it’s essential to detect when this drift occurs.
That’s where the linear-drift-detector helps: it quantifies the change in feature relationships using OLS regression coefficients.


Core Idea

Even if your production model is nonlinear (like Random Forest or XGBoost),
we can still proxy the structural relationship between features and targets using a simple linear fit.

We fit an OLS model on:

  • The training dataset (X_train, y_train)
  • The production dataset (X_prod, y_prod_prediction — actual labels or predicted outputs)

Then, we compare the learned coefficients.

If the coefficients shift significantly between the two datasets,
it indicates a potential concept drift in the data-generating process.


Mathematical Foundation

Let the relationship between target y and features X be modeled as:

$$ y = X\beta + \epsilon $$

Where:

  • $X$: Feature matrix
  • $\beta$: Coefficient vector
  • $\epsilon$: Random noise term

We fit two models:

$$ \hat{\beta}{train} = (X{train}^T X_{train})^{-1} X_{train}^T y_{train} $$

$$ \hat{\beta}{prod} = (X{prod}^T X_{prod})^{-1} X_{prod}^T y_{prod} $$

Then compute:

$$ \Delta \beta = \hat{\beta}{prod} - \hat{\beta}{train} $$

To statistically test if the difference is significant:

$$ Z_i = \frac{\hat{\beta}{prod,i} - \hat{\beta}{train,i}}{\sqrt{SE_{train,i}^2 + SE_{prod,i}^2}} $$

Where $SE$ is the standard error of each coefficient.

The two-tailed p-value is computed as:

$$ p_i = 2(1 - \Phi(|Z_i|)) $$


Algorithm Overview

  1. Input:
    • X_train, y_train: historical (training) data
    • X_prod, y_prod_prediction: production data and outputs (actual or predicted)
  2. Fit two OLS models:
    • model_train = OLS(y_train, X_train)
    • model_prod = OLS(y_prod_prediction, X_prod)
  3. Extract coefficients and standard errors
  4. Compute difference metrics:
    • Δβ (coefficient shift)
    • L2 norm distance
    • Z-test and p-values for statistical significance
  5. Return diagnostic report

Installation

pip install linear-drift-detector

Quick Start Example (Regression)

import numpy as np
from linear_drift_detector import linear_coefficient_shift

# Generate training data
np.random.seed(42)
X_train = np.random.randn(200, 3)
y_train = 3*X_train[:,0] - 2.5*X_train[:,1] + 4*X_train[:,2] + np.random.randn(200)*0.5

# Generate production data (shifted relationships)
X_prod = np.random.randn(200, 3)
y_prod_pred = 4*X_prod[:,0] - 1.5*X_prod[:,1] + 5*X_prod[:,2] + np.random.randn(200)*0.5

# Run drift detection
result = linear_coefficient_shift(X_train, y_train, X_prod, y_prod_pred)

# Print diagnostic outputs
print(result["z_test"])
print("L2 Distance:", result["l2_distance"])

Example Output

coef_train coef_prod diff z_value p_value
const 0.01234 0.02345 0.01111 0.28 0.776
x1 2.98456 3.99123 1.00667 5.12 0.000
x2 -2.48721 -1.49834 0.98887 4.97 0.000
x3 4.01245 4.98678 0.97433 4.54 0.000

L2 Distance: 1.71

Interpretation: Significant p-values (< 0.05) and large L2 distance indicate a strong concept drift.


Example (Classification)

Even for classification tasks, OLS can be used as a proxy detector for internal data structure shifts.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from linear_drift_detector import linear_coefficient_shift

# Training data
X_train, y_train = make_classification(
    n_samples=200, n_features=3, n_informative=3, n_redundant=0, random_state=42
)

# Production data with changed separation
X_prod, y_prod = make_classification(
    n_samples=200, n_features=3, n_informative=3, n_redundant=0, class_sep=1.5, random_state=99
)

# Simulate model predictions
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_prod_pred = clf.predict_proba(X_prod)[:, 1]

# Detect drift
result = linear_coefficient_shift(X_train, y_train, X_prod, y_prod_pred)
print(result["z_test"])
print("L2 Distance:", result["l2_distance"])

Here, the production dataset has a different internal structure, and the drift detector highlights this through coefficient divergence. The output is similar to regression.


Output Details

The function returns a dictionary:

Key Description
coef_train Coefficients from training OLS
coef_prod Coefficients from production OLS
coef_diff Difference vector (production - training)
l2_distance Magnitude of coefficient drift
z_test DataFrame with z-values and p-values for each coefficient

Interpretation

High L2 Distance: overall structural shift in data

Low p-values (< 0.05): statistically significant coefficient drift

Large Δβ: feature relationship changed

Stable coefficients: no significant drift


When to Use

Monitor deployed regression or classification models

Detect data drift when retraining is expensive

Quantify how much internal data relationship has changed

Build interpretability into data drift detection pipelines


Limitations

OLS assumes a linear relationship — may not match nonlinear models

Requires same feature dimensionality (X_train.shape == X_prod.shape)

Sensitive to scaling (consider standardizing features)

Works best as a proxy detector, may not as a perfect substitute for full statistical drift tests


License

MIT License © 2025 Developed for the open-source data science community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

linear_drift_detector-0.1.0.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

linear_drift_detector-0.1.0-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file linear_drift_detector-0.1.0.tar.gz.

File metadata

  • Download URL: linear_drift_detector-0.1.0.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for linear_drift_detector-0.1.0.tar.gz
Algorithm Hash digest
SHA256 30511abb8e982ef72b2fc1db3c65cb52a763255650a8952949d8a3768e0eb233
MD5 cf84da781c5fc227b2ef34fde03b356d
BLAKE2b-256 c279dca7c0a75547856c9257051ba04832b92c4796e6f76e230f2af205693284

See more details on using hashes here.

File details

Details for the file linear_drift_detector-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for linear_drift_detector-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3e2dab87c715b15a074e1d811c842c4f6b27850de49f5a303c782145eb942841
MD5 0bd4b7da2699ea4e8c597db286825f2e
BLAKE2b-256 10128d6429a37aa044aae34d96a35ad91f9579980f3e4e63844ab473d41a7ffa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page