Two-samples Testing and Distribution Shift Detection with AutoML
Project description
AutoML Two-Sample Test
autotst
is a Python package for easy-to-use two-sample testing and distribution shift detection.
Given two datasets sample_P
and sample_Q
drawn from distributions $P$ and $Q$, the
goal is to estimate a $p$-value for the null hypothesis $P=Q$.
autotst
achieves this by learning a witness function and taking its mean discrepancy as a test statistic
(see References).
The package provides functionalities to prepare the data, an interface to train an ML model, and methods to evaluate $p$-values and interpret results.
By default, autotst uses the Tabular Predictor of AutoGluon, but it is easy to wrap and use your own favorite ML framework (see below).
The full documentation of the package can be found here.
Installation
Requires at least Python 3.7. Since the installation also installs AutoGluon, it can take a few moments.
pip install autotst
How to use autotst
We provide worked out examples in the 'Example' directory. In the following assume that
sample_P
and sample_Q
are two numpy
arrays containing samples from $P$ and $Q$.
Default Usage:
The easiest way to compute a $p$-value is to use the default settings
import autotst
tst = autotst.AutoTST(sample_P, sample_Q)
p_value = tst.p_value()
You would then reject the null hypothesis if p_value
is smaller or equal to your significance level.
Customizing the testing pipeline
We highly recommend to use the pipeline step by step, which would look like this:
import autotst
from autotst.model import AutoGluonTabularPredictor
tst = autotst.AutoTST(sample_P, sample_Q, split_ratio=0.5, model=AutoGluonTabularPredictor)
tst.split_data()
tst.fit_witness(time_limit=60) # time limit adjustable to your needs (in seconds)
p_value = tst.p_value_evaluate(permutations=10000) # control number of permutations in the estimation
This allows you to change the time limit for fitting the witness function and you can also pass other arguments to the fit model (see AutoGluon documentation).
Since the permutations are very cheap, the default number of permutations is relatively high and should work for most use-cases. If your significance level is, say, smaller than 1/1000, consider increasing it further.
Customizing the machine learning model
If you have good domain knowledge about your problem and believe that a specific ML framework will work well,
it is easy to wrap your model.
Therefore, simply inherit from the class Model
and wrap the methods
(see our implementation in model.py
).
You can then run the test simply by importing your model and initializing the test accordingly.
import autotst
tst = autotst.AutoTST(sample_P, sample_Q, model=YourCustomModel)
...
... etc.
We also provide a wrapper for AutoGluonImagePredictor
. However, it seems that this should not be used
with small datasets and small training times.
References
If you use this package, please cite this paper:
Jonas M. Kübler, Vincent Stimper, Simon Buchholz, Krikamol Muandet, Bernhard Schölkopf: "AutoML Two-Sample Test", arXiv 2206.08843 (2022)
Bibtex:
@misc{kubler2022autotst,
doi = {10.48550/ARXIV.2206.08843},
url = {https://arxiv.org/abs/2206.08843},
author = {Kübler, Jonas M. and Stimper, Vincent and Buchholz, Simon and Muandet, Krikamol and Schölkopf, Bernhard},
title = {AutoML Two-Sample Test},
publisher = {arXiv},
year = {2022},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file autotst-1.2.tar.gz
.
File metadata
- Download URL: autotst-1.2.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.14 CPython/3.7.13 Linux/5.15.0-43-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cf8adeb752cea8c6da5938bf323fc977f4fdcc2cf03be4ba4224ff40e6d6dba0 |
|
MD5 | 036291787e2f60c106284b68e8f8c095 |
|
BLAKE2b-256 | cea4df1eedefcff8b57e999dc2c591abcb30d27a15d50275062bd410d353cf2c |
File details
Details for the file autotst-1.2-py3-none-any.whl
.
File metadata
- Download URL: autotst-1.2-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.14 CPython/3.7.13 Linux/5.15.0-43-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ea2345084e4c5f221b14e0171462ac2386f461910edaf2635da36194f5ce65b |
|
MD5 | b78929d539f62d2cf7ac78fad65bedf3 |
|
BLAKE2b-256 | 23ce750bee35dcb954f3c1e472616e4e7409b8a82ebb88edd4344ae05584fda8 |