Skip to main content

DriftLens: an Unsupervised Drift Detection framework

Project description

Unsupervised Concept Drift Detection
from Deep Learning Representations on Unstructured Data in Real-time


Documentation Status Version License arxiv preprint Downloads

DriftLens is an unsupervised drift detection framework for deep learning classifiers on unstructured data.

The DriftLens methodology and its evaluation is currently Under Review.

The preliminary idea was first proposed in the paper: Drift Lens: Real-time unsupervised Concept Drift detection by evaluating per-label embedding distributions (Greco et al., 2021)

DriftLens as been also implemented in a web application tool GitHub.

Table of Contents

Installation

DriftLens is available on PyPI and can be installed with pip for Python >= 3.

# Install latest stable version
pip install driftlens

# Alternatively, install latest development version
pip install git+https://github.com/grecosalvatore/drift-lens

Example of usage

from driftlens.driftlens import DriftLens

# DriftLens parameters
batch_n_pc = 150 # Number of principal components to reduce per-batch embeddings
per_label_n_pc = 75 # Number of principal components to reduce per-label embeddings
window_size = 1000 # Window size for drift detection
threshold_number_of_estimation_samples = 1000 # Number of sampled windows to estimate the threshold values

# Initialize DriftLens
dl = DriftLens()

# Estimate the baseline (offline phase)
baseline = dl.estimate_baseline(E=E_train,
                                Y=Y_predicted_train,
                                label_list=training_label_list,
                                batch_n_pc=batch_n_pc,
                                per_label_n_pc=per_label_n_pc)

# Estimate the threshold values with DriftLens (offline phase)
per_batch_distances_sorted, per_label_distances_sorted = dl.random_sampling_threshold_estimation(
                                                            label_list=training_label_list,
                                                            E=E_test,
                                                            Y=Y_predicted_test,
                                                            batch_n_pc=batch_n_pc,
                                                            per_label_n_pc=per_label_n_pc,
                                                            window_size=window_size,
                                                            n_samples=threshold_number_of_estimation_samples,
                                                            flag_shuffle=True,
                                                            flag_replacement=True)

# Compute the window distribution distances (Frechet Inception Distance) with DriftLens
dl_distance = dl.compute_window_distribution_distances(E_windows[0], Y_predicted_windows[0])

DriftLens Methodology

DriftLens Methodology.


DriftLens is an unsupervised drift detection technique based on distribution distances within the embedding representations generated by deep learning models. The methodology includes an offline and an online phases.

In the offline phase, DriftLens, takes in input a historical dataset (i.e., baseline and threshold datasets), then:

  1. Estimates the reference distributions from the baseline dataset (e.g., training dataset). The reference distributions, called baseline, represent the distribution of features (i.e., embedding) that the model has learned during the training phase (i.e., they represent the absence of drift).
  2. Estimates threshold distance values from the threshold dataset to discriminate between drift and no-drift conditions.

In the online phase, the new data stream is processed in windows of fixed size. For each window, DriftLens:

  1. Estimates the distributions of the new data windows
  2. it computes the distribution distances with respect to the reference distributions
  3. it evaluates the distances against the threshold values. If the distance exceeds the threshold, the presence of drift is predicted.

In both phases, the distributions are estimated as multivariate normal distribution by computing the mean and the covariance over the embedding vectors.

DriftLens uses the Frechet Distance to measure the similarity between the reference (i.e., baseline) and the new window distributions.

Experiments Reproducibility

Instructions and scripts for the experimental evaluation reproducibility are located in the experiments folder.

References

If you use the DriftLens, please cite the following papers:

  1. DriftLens methodology and evaluation is currently under review. The pre-print is available at:
@misc{greco2024unsupervisedconceptdriftdetection,
      title={Unsupervised Concept Drift Detection from Deep Learning Representations in Real-time}, 
      author={Salvatore Greco and Bartolomeo Vacchetti and Daniele Apiletti and Tania Cerquitelli},
      year={2024},
      eprint={2406.17813},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2406.17813}, 
}
  1. Preliminary idea
@INPROCEEDINGS{driftlens,
  author={Greco, Salvatore and Cerquitelli, Tania},
  booktitle={2021 International Conference on Data Mining Workshops (ICDMW)}, 
  title={Drift Lens: Real-time unsupervised Concept Drift detection by evaluating per-label embedding distributions}, 
  year={2021},
  volume={},
  number={},
  pages={341-349},
  doi={10.1109/ICDMW53433.2021.00049}
  }
  1. Webapp tool
@inproceedings{greco2024driftlens,
  title={DriftLens: A Concept Drift Detection Tool},
  author={Greco, Salvatore and Vacchetti, Bartolomeo and Apiletti, Daniele and Cerquitelli, Tania and others},
  booktitle={Advances in Database Technology},
  volume={27},
  pages={806--809},
  year={2024},
  organization={Open proceedings}
}

Authors

  • Salvatore Greco, Politecnico di Torino - Homepage - GitHub - Twitter
  • Bartolomeo Vacchetti, Politecnico di Torino
  • Daniele Apiletti, Politecnico di Torino - Homepage
  • Tania Cerquitelli, Politecnico di Torino - Homepage

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

driftlens-0.1.4.tar.gz (4.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

driftlens-0.1.4-py3-none-any.whl (4.7 kB view details)

Uploaded Python 3

File details

Details for the file driftlens-0.1.4.tar.gz.

File metadata

  • Download URL: driftlens-0.1.4.tar.gz
  • Upload date:
  • Size: 4.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.12

File hashes

Hashes for driftlens-0.1.4.tar.gz
Algorithm Hash digest
SHA256 b97f09725cd316896bb541468f66062c6852d4b88e1dbc0ee360ecd9eecf0173
MD5 96e5b18e160d294e370ee81559f84150
BLAKE2b-256 23d318bb3d75d465d94a12a56a41e8fd919ebddd32e4d301b5e8a329c36ace0e

See more details on using hashes here.

File details

Details for the file driftlens-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: driftlens-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 4.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.12

File hashes

Hashes for driftlens-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 98e95c2fd8bf0e9070be095cfb570d249527b36228262d175026a3552923f706
MD5 ef6ea1c58728e1935a550cb45c087525
BLAKE2b-256 e71703396776ee1435c831d2815e55bddee7b1e6189671a3a8f0b43fff619e5e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page