A Python Toolbox for Outlier Detection Thresholding
Project description
Deployment, Stats, & License
PyThresh is a comprehensive and scalable Python toolkit for thresholding outlier detection scores in univariate/multivariate data. It has been writen to work in tandem with PyOD and has similar syntax and data structures. However, it is not limited to this single library. PyThresh is meant to threshold scores generated by an outlier detection. It thresholds scores without the need to set a contamination level or have the user guess the amount of outliers that may exist in the dataset beforehand. These non-parametric methods were written to reduce the user’s input/guess work and rather rely on statistics instead to threshold outlier scores. For thresholding to be applied correctly, the outlier detection scores must follow this rule: the higher the score, the higher the probability that it is an outlier in the dataset. All threshold functions return a binary array where inliers and outliers are represented by a 0 and 1 respectively.
PyThresh includes more than 30 thresholding algorithms. These algorithms range from using simple statistical analysis like the Z-score to more complex mathematical methods that involve graph theory and topology.
Outlier Detection Thresholding with 7 Lines of Code:
# train the KNN detector
from pyod.models.knn import KNN
from pythresh.thresholds.clust import CLUST
clf = KNN()
clf.fit(X_train)
# get outlier scores
decision_scores = clf.decision_scores_ # raw outlier scores on the train data
# get outlier labels
thres = CLUST()
labels = thres.eval(decision_scores)
Installation
It is recommended to use pip for installation:
pip install pythresh # normal install
pip install --upgrade pythresh # or update if needed
Alternatively, you can get the version with the latest updates by cloning the repo and run setup.py file:
git clone https://github.com/KulikDM/pythresh.git
cd pythresh
pip install .
Or with pip:
pip install https://github.com/KulikDM/pythresh/archive/main.zip
Required Dependencies:
matplotlib
numpy>=1.13
pyclustering
pyod
scipy>=1.3.1
scikit_learn>=0.20.0
six
Optional Dependencies:
ruptures (used in the CPD thresholder)
geomstats (used in the KARCH thresholder)
scikit-lego (used in the META thresholder)
joblib>=0.14.1 (used in the META thresholder)
pandas (used in the META thresholder)
torch (used in the VAE thresholder)
tqdm (used in the VAE thresholder)
API Cheatsheet
eval(score): evaluate outlier score.
Key Attributes of threshold:
thresh_: Return the threshold value that separates inliers from outliers. Outliers are considered all values above this threshold value. Note the threshold value has been derived from normalized scores.
confidence_interval_: Return the lower and upper confidence interval of the contamination level. Only applies to the ALL thresholder
External Feature Cases
Towards Data Science: Thresholding Outlier Detection Scores with PyThresh
Towards Data Science: When Outliers are Significant: Weighted Linear Regression
ArXiv: Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection
Available Thresholding Algorithms
Abbr |
Description |
References |
Documentation |
---|---|---|---|
AUCP |
Area Under Curve Percentage |
||
BOOT |
Bootstrapping |
||
CHAU |
Chauvenet’s Criterion |
||
CLF |
Trained Linear Classifier |
||
CLUST |
Clustering Based |
||
CPD |
Change Point Detection |
||
DECOMP |
Decomposition |
||
DSN |
Distance Shift from Normal |
||
EB |
Elliptical Boundary |
||
FGD |
Fixed Gradient Descent |
||
FILTER |
Filtering Based |
||
FWFM |
Full Width at Full Minimum |
||
GESD |
Generalized Extreme Studentized Deviate |
||
HIST |
Histogram Based |
||
IQR |
Inter-Quartile Region |
||
KARCH |
Karcher mean (Riemannian Center of Mass) |
||
MAD |
Median Absolute Deviation |
||
MCST |
Monte Carlo Shapiro Tests |
||
META |
Meta-model Trained Classifier |
||
MOLL |
Friedrichs’ Mollifier |
||
MTT |
Modified Thompson Tau Test |
||
OCSVM |
One-Class Support Vector Machine |
||
QMCD |
Quasi-Monte Carlo Discrepancy |
||
REGR |
Regression Based |
||
VAE |
Variational Autoencoder |
||
WIND |
Topological Winding Number |
||
YJ |
Yeo-Johnson Transformation |
||
ZSCORE |
Z-score |
||
ALL |
All Thresholders Combined |
None |
Implementations & Benchmarks
The comparison among implemented models and general implementation is made available below
Additional benchmarking has been done on all the thresholders and it was found that the FILTER thresholder performed best while the META thresholder provided the smallest uncertainty about its mean and is the most robust (best least accurate prediction).
For Jupyter Notebooks, please navigate to notebooks.
A quick look at all the thresholders performance can be found at “/notebooks/Compare All Models.ipynb”
References
Please Note not all references’ exact methods have been employed in PyThresh. Rather, the references serve to demonstrate the validity of the threshold types available in PyThresh.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pythresh-0.2.9.tar.gz
.
File metadata
- Download URL: pythresh-0.2.9.tar.gz
- Upload date:
- Size: 141.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d85f91c8482884a637c23a87ef6e2655e91f81d28fe6fc08d742a033098c2409 |
|
MD5 | 9a2150b7085e97e9d75d3b5d5761ad99 |
|
BLAKE2b-256 | 7cf0ecebb072349997f84e657799e6694603289733311a1c6aad36af0b24d62b |