Skip to main content

A Python Toolbox for Outlier Detection Thresholding

Project description

Deployment, Stats, & License

PyPI version Anaconda version Documentation status testing Codecov Maintainability GitHub stars Downloads Python versions License Zenodo DOI


PyThresh is a comprehensive and scalable Python toolkit for thresholding outlier detection likelihood scores in univariate/multivariate data. It has been written to work in tandem with PyOD and has similar syntax and data structures. However, it is not limited to this single library. PyThresh is meant to threshold likelihood scores generated by an outlier detector. It thresholds these likelihood scores and replaces the need to set a contamination level or have the user guess the amount of outliers that may exist in the dataset beforehand. These non-parametric methods were written to reduce the user’s input/guess work and rather rely on statistics instead to threshold outlier likelihood scores. For thresholding to be applied correctly, the outlier detection likelihood scores must follow this rule: the higher the score, the higher the probability that it is an outlier in the dataset. All threshold functions return a binary array where inliers and outliers are represented by a 0 and 1 respectively.

PyThresh includes more than 30 thresholding algorithms. These algorithms range from using simple statistical analysis like the Z-score to more complex mathematical methods that involve graph theory and topology.

What’s New in PyThresh V1

The transition of PyThresh to V1 sees many new features!

Sklearn Compatibility:

  • The fit and predict methods have been introduced, enhancing alignment with Sklearn compatibility.

  • These methods allow a thresholder to be fitted on training data and evaluated on unseen data using the predict method.

  • Previously,this functionality was cumbersome to implement using the eval method.

  • Full backward compatibility with the eval method has been maintained.

  • Checks ensure that results remain consistent between <V1 and V1.

  • The BaseEstimator has been integrated into the BaseThresholder.

  • This addition provides enhanced Sklearn compatibility to all thresholders and better integration with existing Sklearn pipelines.

Reproducibility Enhancements:

  • All thresholders now include a random seed to ensure better reproducibility.

  • Previously, some components in the thresholders differed due to randomness.

Improved Testing and Examples:

  • Much more robust tests have been added to ensure the functionality and reliability of the code.

  • These tests enhance confidence in the correctness of the implementation and prevent regressions.

  • All examples have been updated and new jupyter notebooks have been added to introduce all the capabilities of PyThresh.

Documentation & Citing

Visit PyThresh Docs for full documentation or see below for a quickstart installation and usage example.

To cite this work you can visit PyThresh Citation


Outlier Detection Thresholding with 8 Lines of Code:

# train the KNN detector
from pyod.models.knn import KNN
from pythresh.thresholds.karch import KARCH

clf = KNN()
clf.fit(X_train)

# get outlier likelihood scores
decision_scores = clf.decision_scores_

# get outlier labels
thres = KARCH()
thres.fit(decision_scores)

labels = thres.labels_ # or thres.predict(decision_scores)

or using multiple outlier detection score sets

# train multiple detectors
from pyod.models.knn import KNN
from pyod.models.pca import PCA
from pyod.models.iforest import IForest
from pythresh.thresholds.karch import KARCH

clfs = [KNN(), IForest(), PCA()]

# get outlier likelihood scores for each detector
scores = [clf.fit(X_train).decision_scores_ for clf in clfs]

scores = np.vstack(scores).T

# get outlier labels
thres = KARCH()
thres.fit(decision_scores)

labels = thres.labels_ # or thres.predict(decision_scores)

Installation

It is recommended to use pip or conda for installation:

pip install pythresh            # normal install
pip install --upgrade pythresh  # or update if needed
conda install -c conda-forge pythresh

Alternatively, you can get the version with the latest updates by cloning the repo and run setup.py file:

git clone https://github.com/KulikDM/pythresh.git
cd pythresh
pip install .

Or with pip:

pip install https://github.com/KulikDM/pythresh/archive/main.zip

Required Dependencies:

  • numpy>=1.19

  • pyod

  • scipy>=1.5.1

  • scikit_learn>=0.22.0

Optional Dependencies:

  • pyclustering (used in the CLUST thresholder)

  • ruptures (used in the CPD thresholder)

  • scikit-lego (used in the META thresholder)

  • joblib>=0.14.1 (used in the META thresholder and RANK)

  • pandas (used in the META thresholder)

  • torch (used in the VAE thresholder)

  • tqdm (used in the VAE thresholder)

  • xgboost>=2.0.0 (used in the RANK)

API Cheatsheet

  • eval(score): evaluate a single outlier or multiple outlier detection likelihood score set (Legacy method).

  • fit(score): fit a thresholder for a single outlier or multiple outlier detection likelihood score set.

  • predict(score): predict the binary labels using the fitted thresholder on a single outlier or multiple outlier detection likelihood score set

Key Attributes of threshold:

  • thresh_: Return the threshold value that separates inliers from outliers. Outliers are considered all values above this threshold value. Note the threshold value has been derived from likelihood scores normalized between 0 and 1.

  • labels_: A binary array of labels for the fitted thresholder on the fitted dataset.

  • confidence_interval_: Return the lower and upper confidence interval of the contamination level. Only applies to the COMB thresholder.

  • dscores_: 1D array of the TruncatedSVD decomposed decision scores if multiple outlier detector score sets are passed.

  • mixture_: fitted mixture model class of the selected model used for thresholding. Only applies to MIXMOD. Attributes include: components, weights, params. Functions include: fit, loglikelihood, pdf, and posterior.

External Feature Cases

Towards Data Science: Thresholding Outlier Detection Scores with PyThresh

Towards Data Science: When Outliers are Significant: Weighted Linear Regression

ArXiv: Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection.

Available Thresholding Algorithms

Abbr

Description

References

Documentation

AUCP

Area Under Curve Percentage

[1]

pythresh.thresholds.aucp module

BOOT

Bootstrapping

[2]

pythresh.thresholds.boot module

CHAU

Chauvenet’s Criterion

[3]

pythresh.thresholds.chau module

CLF

Trained Linear Classifier

[4]

pythresh.thresholds.clf module

CLUST

Clustering Based

[5]

pythresh.thresholds.clust module

CPD

Change Point Detection

[6]

pythresh.thresholds.cpd module

DECOMP

Decomposition

[7]

pythresh.thresholds.decomp module

DSN

Distance Shift from Normal

[8]

pythresh.thresholds.dsn module

EB

Elliptical Boundary

[9]

pythresh.thresholds.eb module

FGD

Fixed Gradient Descent

[10]

pythresh.thresholds.fgd module

FILTER

Filtering Based

[11]

pythresh.thresholds.filter module

FWFM

Full Width at Full Minimum

[12]

pythresh.thresholds.fwfm module

GAMGMM

Bayesian Gamma GMM

[13]

pythresh.thresholds.gamgmm module

GESD

Generalized Extreme Studentized Deviate

[14]

pythresh.thresholds.gesd module

HIST

Histogram Based

[15]

pythresh.thresholds.hist module

IQR

Inter-Quartile Region

[16]

pythresh.thresholds.iqr module

KARCH

Karcher mean (Riemannian Center of Mass)

[17]

pythresh.thresholds.karch module

MAD

Median Absolute Deviation

[18]

pythresh.thresholds.mad module

MCST

Monte Carlo Shapiro Tests

[19]

pythresh.thresholds.mcst module

META

Meta-model Trained Classifier

[20]

pythresh.thresholds.meta module

MIXMOD

Normal & Non-Normal Mixture Models

[21]

pythresh.thresholds.mixmod module

MOLL

Friedrichs’ Mollifier

[22] [23]

pythresh.thresholds.moll module

MTT

Modified Thompson Tau Test

[24]

pythresh.thresholds.mtt module

OCSVM

One-Class Support Vector Machine

[25]

pythresh.thresholds.ocsvm module

QMCD

Quasi-Monte Carlo Discrepancy

[26]

pythresh.thresholds.qmcd module

REGR

Regression Based

[27]

pythresh.thresholds.regr module

VAE

Variational Autoencoder

[28]

pythresh.thresholds.vae module

WIND

Topological Winding Number

[29]

pythresh.thresholds.wind module

YJ

Yeo-Johnson Transformation

[30]

pythresh.thresholds.yj module

ZSCORE

Z-score

[31]

pythresh.thresholds.zscore module

COMB

Thresholder Combination

None

pythresh.thresholds.comb module

DUMMY

Dummy Percentile Thresholder

None

pythresh.thresholds.dummy module

Implementations, Benchmarks, & Utilities

The comparison among implemented models and general implementation is made available below

Additional benchmarking has been done on all the thresholders and it was found that the MIXMOD thresholder performed best while the CLF thresholder provided the smallest uncertainty about its mean and is the most robust (best least accurate prediction). However, for interpretability and general performance the MIXMOD, FILTER, and META thresholders are good fits.

Further utilities are available for assisting in the selection of the most optimal outlier detection and thresholding methods ranking as well as determining the confidence with regards to the selected thresholding method thresholding confidence


Tutorial Notebooks

Notebook

Description

Introduction

Basic intro into outlier thresholding

Advanced Thresholding

Additional thresholding options for more advanced use

Threshold Confidence

Calculating the confidence levels around the threshold point

Outlier Ranking

Assisting in selecting the best performing outlier and thresholding method combo using ranking

A quick look at all the thresholders performance can be found at Compare Thresholders

Comparision_of_All

Contributing

Anyone is welcome to contribute to PyThresh:

  • Please share your ideas and ask questions by opening an issue.

  • To contribute, first check the Issue list for the “help wanted” tag and comment on the one that you are interested in. The issue will then be assigned to you.

  • If the bug, feature, or documentation change is novel (not in the Issue list), you can either log a new issue or create a pull request for the new changes.

  • To start, fork the main branch and add your improvement/modification/fix.

  • To make sure the code has the same style and standard, please refer to qmcd.py for example.

  • Create a pull request to the main branch and follow the pull request template PR template

  • Please make sure that all code changes are accompanied with proper new/updated test functions. Automatic tests will be triggered. Before the pull request can be merged, make sure that all the tests pass.


References

Please Note not all references’ exact methods have been employed in PyThresh. Rather, the references serve to demonstrate the validity of the threshold types available in PyThresh.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pythresh-1.0.3.tar.gz (563.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pythresh-1.0.3-py3-none-any.whl (578.2 kB view details)

Uploaded Python 3

File details

Details for the file pythresh-1.0.3.tar.gz.

File metadata

  • Download URL: pythresh-1.0.3.tar.gz
  • Upload date:
  • Size: 563.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pythresh-1.0.3.tar.gz
Algorithm Hash digest
SHA256 d0a73a743e6db248929018e9c78fcd05f5fe5dae6d7f02bb603b9885583c781a
MD5 ee6a486e880c0fdafe51317d60c6d9fc
BLAKE2b-256 01d6df19374ff18a1c77259b0d0a9b430e858f3fd7f1df9be46fb0c4c1d4e98d

See more details on using hashes here.

File details

Details for the file pythresh-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: pythresh-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 578.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pythresh-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ad6e302e3e5b306689f50e88a55ca9e5eb1885439ea40268ced058a5d2b85afe
MD5 6ffafac40c91e801e8f08c02e9fa2bed
BLAKE2b-256 3c3ecd9d109ece4b3e2aee87a7dbee8b6abb1de47b67a8268e4208311613a1ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page