pythresh

A Python Toolbox for Outlier Detection Thresholding

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Deployment, Stats, & License

PyThresh is a comprehensive and scalable Python toolkit for thresholding outlier detection likelihood scores in univariate/multivariate data. It has been written to work in tandem with PyOD and has similar syntax and data structures. However, it is not limited to this single library. PyThresh is meant to threshold likelihood scores generated by an outlier detector. It thresholds these likelihood scores and replaces the need to set a contamination level or have the user guess the amount of outliers that may exist in the dataset beforehand. These non-parametric methods were written to reduce the user’s input/guess work and rather rely on statistics instead to threshold outlier likelihood scores. For thresholding to be applied correctly, the outlier detection likelihood scores must follow this rule: the higher the score, the higher the probability that it is an outlier in the dataset. All threshold functions return a binary array where inliers and outliers are represented by a 0 and 1 respectively.

PyThresh includes more than 30 thresholding algorithms. These algorithms range from using simple statistical analysis like the Z-score to more complex mathematical methods that involve graph theory and topology.

Documentation & Citing

Visit PyThresh Docs for full documentation or see below for a quickstart installation and usage example.

To cite this work you can visit PyThresh Citation

Outlier Detection Thresholding with 7 Lines of Code:

# train the KNN detector
from pyod.models.knn import KNN
from pythresh.thresholds.filter import FILTER

clf = KNN()
clf.fit(X_train)

# get outlier scores
decision_scores = clf.decision_scores_  # raw outlier scores on the train data

# get outlier labels
thres = FILTER()
labels = thres.eval(decision_scores)

or using multiple outlier detection score sets

# train multiple detectors
from pyod.models.knn import KNN
from pyod.models.pca import PCA
from pyod.models.iforest import IForest
from pythresh.thresholds.filter import FILTER

clfs = [KNN(), IForest(), PCA()]

# get outlier scores for each detector
scores = [clf.fit(X_train).decision_scores_ for clf in clfs]

scores = np.vstack(scores).T

# get outlier labels
thres = FILTER()
labels = thres.eval(scores)

Installation

It is recommended to use pip or conda for installation:

pip install pythresh            # normal install
pip install --upgrade pythresh  # or update if needed

conda install -c conda-forge pythresh

Alternatively, you can get the version with the latest updates by cloning the repo and run setup.py file:

git clone https://github.com/KulikDM/pythresh.git
cd pythresh
pip install .

Or with pip:

pip install https://github.com/KulikDM/pythresh/archive/main.zip

Required Dependencies:

matplotlib
numpy>=1.13
pyod
scipy>=1.3.1
scikit_learn>=0.20.0

Optional Dependencies:

pyclustering (used in the CLUST thresholder)
ruptures (used in the CPD thresholder)
geomstats (used in the KARCH thresholder)
scikit-lego (used in the META thresholder)
joblib>=0.14.1 (used in the META thresholder and RANK)
pandas (used in the META thresholder)
torch (used in the VAE thresholder)
tqdm (used in the VAE thresholder)
xgboost>=2.0.0 (used in the RANK)

API Cheatsheet

eval(score): evaluate a single outlier or multiple outlier detection likelihood score sets.

Key Attributes of threshold:

thresh_: Return the threshold value that separates inliers from outliers. Outliers are considered all values above this threshold value. Note the threshold value has been derived from likelihood scores normalized between 0 and 1.
confidence_interval_: Return the lower and upper confidence interval of the contamination level. Only applies to the COMB thresholder
dscores_: 1D array of the TruncatedSVD decomposed decision scores if multiple outlier detector score sets are passed
mixture_: fitted mixture model class of the selected model used for thresholding. Only applies to MIXMOD. Attributes include: components, weights, params. Functions include: fit, loglikelihood, pdf, and posterior.

External Feature Cases

Towards Data Science: Thresholding Outlier Detection Scores with PyThresh

Towards Data Science: When Outliers are Significant: Weighted Linear Regression

ArXiv: Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection.

Available Thresholding Algorithms

Abbr	Description	References	Documentation
AUCP	Area Under Curve Percentage	[1]	pythresh.thresholds.aucp module
BOOT	Bootstrapping	[2]	pythresh.thresholds.boot module
CHAU	Chauvenet’s Criterion	[3]	pythresh.thresholds.chau module
CLF	Trained Linear Classifier	[4]	pythresh.thresholds.clf module
CLUST	Clustering Based	[5]	pythresh.thresholds.clust module
CPD	Change Point Detection	[6]	pythresh.thresholds.cpd module
DECOMP	Decomposition	[7]	pythresh.thresholds.decomp module
DSN	Distance Shift from Normal	[8]	pythresh.thresholds.dsn module
EB	Elliptical Boundary	[9]	pythresh.thresholds.eb module
FGD	Fixed Gradient Descent	[10]	pythresh.thresholds.fgd module
FILTER	Filtering Based	[11]	pythresh.thresholds.filter module
FWFM	Full Width at Full Minimum	[12]	pythresh.thresholds.fwfm module
GAMGMM	Bayesian Gamma GMM	[13]	pythresh.thresholds.gamgmm module
GESD	Generalized Extreme Studentized Deviate	[14]	pythresh.thresholds.gesd module
HIST	Histogram Based	[15]	pythresh.thresholds.hist module
IQR	Inter-Quartile Region	[16]	pythresh.thresholds.iqr module
KARCH	Karcher mean (Riemannian Center of Mass)	[17]	pythresh.thresholds.karch module
MAD	Median Absolute Deviation	[18]	pythresh.thresholds.mad module
MCST	Monte Carlo Shapiro Tests	[19]	pythresh.thresholds.mcst module
META	Meta-model Trained Classifier	[20]	pythresh.thresholds.meta module
MIXMOD	Normal & Non-Normal Mixture Models	[21]	pythresh.thresholds.mixmod module
MOLL	Friedrichs’ Mollifier	[22] [23]	pythresh.thresholds.moll module
MTT	Modified Thompson Tau Test	[24]	pythresh.thresholds.mtt module
OCSVM	One-Class Support Vector Machine	[25]	pythresh.thresholds.ocsvm module
QMCD	Quasi-Monte Carlo Discrepancy	[26]	pythresh.thresholds.qmcd module
REGR	Regression Based	[27]	pythresh.thresholds.regr module
VAE	Variational Autoencoder	[28]	pythresh.thresholds.vae module
WIND	Topological Winding Number	[29]	pythresh.thresholds.wind module
YJ	Yeo-Johnson Transformation	[30]	pythresh.thresholds.yj module
ZSCORE	Z-score	[31]	pythresh.thresholds.zscore module
COMB	Thresholder Combination	None	pythresh.thresholds.comb module

Implementations, Benchmarks, & Utilities

The comparison among implemented models and general implementation is made available below

Additional benchmarking has been done on all the thresholders and it was found that the MIXMOD thresholder performed best while the CLF thresholder provided the smallest uncertainty about its mean and is the most robust (best least accurate prediction). However, for interpretability and general performance the MIXMOD, FILTER, and META thresholders are good fits.

Further utilities are available for assisting in the selection of the most optimal outlier detection and thresholding methods ranking as well as determining the confidence with regards to the selected thresholding method thresholding confidence

For Jupyter Notebooks, please navigate to notebooks.

A quick look at all the thresholders performance can be found at “/notebooks/Compare All Models.ipynb”

Contributing

Anyone is welcome to contribute to PyThresh:

Please share your ideas and ask questions by opening an issue.
To contribute, first check the Issue list for the “help wanted” tag and comment on the one that you are interested in. The issue will then be assigned to you.
If the bug, feature, or documentation change is novel (not in the Issue list), you can either log a new issue or create a pull request for the new changes.
To start, fork the main branch and add your improvement/modification/fix.
To make sure the code has the same style and standard, please refer to qmcd.py for example.
Create a pull request to the main branch and follow the pull request template PR template
Please make sure that all code changes are accompanied with proper new/updated test functions. Automatic tests will be triggered. Before the pull request can be merged, make sure that all the tests pass.

References

Please Note not all references’ exact methods have been employed in PyThresh. Rather, the references serve to demonstrate the validity of the threshold types available in PyThresh.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.3.6

Feb 3, 2024

0.3.5

Nov 5, 2023

0.3.4

Sep 9, 2023

0.3.3

Aug 6, 2023

0.3.2

Jul 2, 2023

0.3.1

May 3, 2023

0.3.0

Mar 21, 2023

0.2.9

Jan 9, 2023

0.2.8

Nov 8, 2022

0.2.7

Oct 2, 2022

0.2.6

Sep 8, 2022

0.2.5

Aug 4, 2022

0.2.4

Jul 10, 2022

0.2.3

Jul 2, 2022

0.2.2

Jun 25, 2022

0.2.1

Jun 25, 2022

0.2.0

Jun 24, 2022

0.1.9

Jun 16, 2022

0.1.8

Jun 16, 2022

0.1.7

Jun 10, 2022

0.1.6

Jun 9, 2022

0.1.5

Jun 4, 2022

0.1.4

May 31, 2022

0.1.3

May 31, 2022

0.1.2

May 30, 2022

0.1.1

May 30, 2022

0.1.0

May 29, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pythresh-0.3.6.tar.gz (556.5 kB view hashes)

Uploaded Feb 3, 2024 Source

Hashes for pythresh-0.3.6.tar.gz

Hashes for pythresh-0.3.6.tar.gz
Algorithm	Hash digest
SHA256	`8fba5938775bc374a83189e7c674125c3ccc97504787c78ee56c5387920c84a2`
MD5	`8c749a6b324d718eca53b5f7a00eeb5b`
BLAKE2b-256	`69b58c39f334dba91f19f1e732749f514d674e0f1771e086363cdbf3b87e6466`