Skip to main content

Two-samples Testing and Distribution Shift Detection with AutoML

Project description

AutoML Two-Sample Test

Checked with MyPy Code style: Black Tests License: MIT PyPI Downloads arXiv

autotst is a Python package for easy-to-use two-sample testing and distribution shift detection.

Given two datasets sample_P and sample_Q drawn from distributions $P$ and $Q$, the goal is to estimate a $p$-value for the null hypothesis $P=Q$. autotst achieves this by learning a witness function and taking its mean discrepancy as a test statistic (see References).

The package provides functionalities to prepare the data, an interface to train an ML model, and methods to evaluate $p$-values and interpret results.

By default, autotst uses the Tabular Predictor of AutoGluon, but it is easy to wrap and use your own favorite ML framework (see below).

The full documentation of the package can be found here.

Installation

Requires at least Python 3.7. Since the installation also installs AutoGluon, it can take a few moments.

pip install autotst

How to use autotst

We provide worked out examples in the 'Example' directory. In the following assume that sample_P and sample_Q are two numpy arrays containing samples from $P$ and $Q$.

Default Usage:

The easiest way to compute a $p$-value is to use the default settings

import autotst
tst = autotst.AutoTST(sample_P, sample_Q)
p_value = tst.p_value()

You would then reject the null hypothesis if p_value is smaller or equal to your significance level.

Customizing the testing pipeline

We highly recommend to use the pipeline step by step, which would look like this:

import autotst
from autotst.model import AutoGluonTabularPredictor

tst = autotst.AutoTST(sample_P, sample_Q, split_ratio=0.5, model=AutoGluonTabularPredictor)
tst.split_data()
tst.fit_witness(time_limit=60)  # time limit adjustable to your needs (in seconds)
p_value = tst.p_value_evaluate(permutations=10000)  # control number of permutations in the estimation

This allows you to change the time limit for fitting the witness function and you can also pass other arguments to the fit model (see AutoGluon documentation).

Since the permutations are very cheap, the default number of permutations is relatively high and should work for most use-cases. If your significance level is, say, smaller than 1/1000, consider increasing it further.

Customizing the machine learning model

If you have good domain knowledge about your problem and believe that a specific ML framework will work well, it is easy to wrap your model. Therefore, simply inherit from the class Model and wrap the methods (see our implementation in model.py).

You can then run the test simply by importing your model and initializing the test accordingly.

import autotst

tst = autotst.AutoTST(sample_P, sample_Q, model=YourCustomModel)
...
... etc.

We also provide a wrapper for AutoGluonImagePredictor. However, it seems that this should not be used with small datasets and small training times.

References

If you use this package, please cite this paper:

Jonas M. Kübler, Vincent Stimper, Simon Buchholz, Krikamol Muandet, Bernhard Schölkopf: "AutoML Two-Sample Test", arXiv 2206.08843 (2022)

Bibtex:

@misc{kubler2022autotst,
  doi = {10.48550/ARXIV.2206.08843},
  url = {https://arxiv.org/abs/2206.08843},
  author = {Kübler, Jonas M. and Stimper, Vincent and Buchholz, Simon and Muandet, Krikamol and Schölkopf, Bernhard},  
  title = {AutoML Two-Sample Test},
  publisher = {arXiv},
  year = {2022},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autotst-1.2.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

autotst-1.2-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file autotst-1.2.tar.gz.

File metadata

  • Download URL: autotst-1.2.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.7.13 Linux/5.15.0-43-generic

File hashes

Hashes for autotst-1.2.tar.gz
Algorithm Hash digest
SHA256 cf8adeb752cea8c6da5938bf323fc977f4fdcc2cf03be4ba4224ff40e6d6dba0
MD5 036291787e2f60c106284b68e8f8c095
BLAKE2b-256 cea4df1eedefcff8b57e999dc2c591abcb30d27a15d50275062bd410d353cf2c

See more details on using hashes here.

File details

Details for the file autotst-1.2-py3-none-any.whl.

File metadata

  • Download URL: autotst-1.2-py3-none-any.whl
  • Upload date:
  • Size: 10.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.7.13 Linux/5.15.0-43-generic

File hashes

Hashes for autotst-1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2ea2345084e4c5f221b14e0171462ac2386f461910edaf2635da36194f5ce65b
MD5 b78929d539f62d2cf7ac78fad65bedf3
BLAKE2b-256 23ce750bee35dcb954f3c1e472616e4e7409b8a82ebb88edd4344ae05584fda8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page