Skip to main content

drift

Project description

tests pypi pyversion license doc

🔍 Overview

Eurybia is a Python library which aims to help in detecting drift and validate data to put a model into production :

  • Data Validation : Validate that data used for production prediction is similar to training data or test data before deploying it to production
  • Data drift : Evolution of the production data over time compared to training or test data before deploying it to production
  • Concept drift : Evolution in model performance over time due to change in the target variable statistical properties

  • Model deployment :
    • Data validation (image above -> step 6)
  • Model monitoring :
    • Detecting concept drift (image above -> step 8)
    • Detection Data drift (image above -> step 8)

Eurybia helps data analysts and data scientists to collaborate through a report that allows them to exchange on drift monitoring and data validation before deploying model into production. Eurybia also contributes to data science auditing by displaying usefull information about any model and data in a unique report.

  • Readthedocs: documentation badge
  • Medium:

🔥 Features

  • Display clear and understandable insightful report :

  • Allow Data Scientists to quickly explore drift thanks to dynamic reports to easily navigate between drift detection and datasets features :

In a nutshell :

  • Monitoring drift using a scheduler (like Airflow)

  • Evaluate level of data drift

  • Facilitate collaboration between data analysts and data scientists, and easily share and discuss results with non-Data users

More precisely :

  • Render data drift and model drift over time through :
    • Feature importance: features that discriminate the most the two datasets
    • Scatter plot: Feature importance relatively to the drift importance
    • Dataset analysis: distribution comparison between variable from the baseline dataset and the newest one
    • Predicted values analysis: distribution comparison between targets from the baseline dataset and the newest one
    • Performance of the data drift classifier
    • Features contribution for the data drift classifier
    • AUC evolution: comparison of data drift classifier at different period.
    • Model performance evolution: your model performances over time

⚙️ How Eurybia works

Eurybia works mainly with a binary classification model (named datadrift classifier) that tries to predict whether a sample belongs to the training dataset (or baseline dataset) or to the production dataset (or current dataset).

As shown below on the diagram, there are 2 datasets, the baseline and the current one. Those datasets are those we wish to compare in order to assess if data drift occurred. On the first one we create a column named “target”, it will be filled only with 0, on the other hand on the second dataset we also add this column, but this time it will be filled only with 1 values. Our goal is to build a binary classification model on top of those 2 datasets (concatenated). Once trained, this model will be helpful to tell if there is any data drift. To do so we are looking at the model performance through AUC metric. The greater the AUC the greater the drift is. (AUC = 0.5 means no data drift and AUC close to 1 means data drift is occuring)

The explainability of this datadrift classifier allows to prioritise features that are important for drift and to focus on those that have the most impact on the model in production.

To use Eurybia to monitor drift over time, you can use a scheduler to make computations automatically and periodically. One of the schedulers you can use is Apache Airflow. To use it, you can read the official documentation and read blogs like this one: Getting started with Apache Airflow

🛠 Installation

Eurybia is intended to work with Python versions 3.7 to 3.9. Installation can be done with pip:

pip install eurybia

If you encounter compatibility issues you may check the corresponding section in the Eurybia documentation here.

🕐 Quickstart

The 3 steps to display results:

  • Step 1: Declare SmartDrift Object

    you need to pass at least 2 pandas DataFrames in order to instantiate the SmartDrift class (Current or production dataset, baseline or training dataset)

from eurybia import SmartDrift
sd = SmartDrift(
  df_current=df_current,
  df_baseline=df_baseline,
  deployed_model=my_model, # Optional: put in perspective result with importance on deployed model
  encoding=my_encoder # Optional: if deployed_model and encoder to use this model
  )
  • Step 2: Compile Model

    There are different ways to compile the SmartDrift object

sd.compile(
  full_validation=True, # Optional: to save time, leave the default False value. If True, analyze consistency on modalities between columns.
  date_compile_auc='01/01/2022', # Optional: useful when computing the drift for a time that is not now
  datadrift_file="datadrift_auc.csv", # Optional: name of the csv file that contains the performance history of data drift
  )
  • Step 3: Generate report

    The report's content will be enriched if you provided the datascience model (deployed) and its encoder. Note that providing the deployed_model and encoding will only produce useful results if the datasets are both usable by the model (i.e. all features are present, dtypes are correct, etc).

sd.generate_report(
  output_file='output/my_report_name.html',
  title_story="output/my_report_title",
  project_info_file='project_info.yml' # Optional: add information on report
  )

Report Example

📖 Tutorials

This github repository offers a lot of tutorials to allow you to start more concretely in the use of Eurybia.

Overview

Validate Data before model deployment

Measure and analyze Data drift

Measure and analyze Model drift

More details about report and plots

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eurybia-0.0.1.tar.gz (6.0 MB view details)

Uploaded Source

Built Distribution

eurybia-0.0.1-py2.py3-none-any.whl (1.5 MB view details)

Uploaded Python 2 Python 3

File details

Details for the file eurybia-0.0.1.tar.gz.

File metadata

  • Download URL: eurybia-0.0.1.tar.gz
  • Upload date:
  • Size: 6.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.13

File hashes

Hashes for eurybia-0.0.1.tar.gz
Algorithm Hash digest
SHA256 2da882dabb345e4f570b89edf2a688cf7f237742ff6e89b4551f0b079892b549
MD5 d2ff97a64881ae101012c2f8025f4146
BLAKE2b-256 2bf3f2e60a8136b11ca5bdb41031924d83b37da58383f3cb9b3c57e40b411c0e

See more details on using hashes here.

File details

Details for the file eurybia-0.0.1-py2.py3-none-any.whl.

File metadata

  • Download URL: eurybia-0.0.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.13

File hashes

Hashes for eurybia-0.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ce36d4ad81c7e9acb89e8c8ae4fc631c01deaddd0920092a2eb4f8f746db49b3
MD5 1fab863e2e6d9bf3a5e3f71b8cc15ef3
BLAKE2b-256 85b71a1d4b474d01419a23989e81ff71f3d6fa224c0e2d4cbcec1c1b84bceeb8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page