Skip to main content

A Python Toolbox for Outlier Detection Thresholding

Project description

Deployment, Stats, & License

PyPI version Documentation status testing Codecov GitHub stars Downloads Python versions License

PyThresh is a comprehensive and scalable Python toolkit for thresholding outlier detection scores in univariate/multivariate data. It has been writen to work in tandem with PyOD and has similar syntax and data structures. However, it is not limited to this single library. PyThresh is meant to threshold scores generated by an outlier detection. It thresholds scores without the need to set a contamination level or have the user guess the amount of outliers that may exist in the dataset beforehand. These non-parametric methods were written to reduce the user’s input/guess work and rather rely on statistics instead to threshold outlier scores. For thresholding to be applied correctly, the outlier detection scores must follow this rule: the higher the score, the higher the probability that it is an outlier in the dataset. All threshold functions return a binary array where inliers and outliers are represented by a 0 and 1 respectively.

PyThresh includes more than 30 thresholding algorithms. These algorithms range from using simple statistical analysis like the Z-score to more complex mathematical methods that involve graph theory and topology.

Outlier Detection Thresholding with 7 Lines of Code:

# train the KNN detector
from pyod.models.knn import KNN
from pythresh.thresholds.dsn import DSN

clf = KNN()
clf.fit(X_train)

# get outlier scores
decision_scores = clf.decision_scores_  # raw outlier scores on the train data

# get outlier labels
thres = DSN()
labels = thres.eval(decision_scores)

Installation

It is recommended to use pip for installation:

pip install pythresh            # normal install
pip install --upgrade pythresh  # or update if needed

Alternatively, you can get the version with the latest updates by cloning the repo and run setup.py file:

git clone https://github.com/KulikDM/pythresh.git
cd pythresh
pip install .

Or with pip:

pip install https://github.com/KulikDM/pythresh/archive/main.zip

Required Dependencies:

  • matplotlib

  • numpy>=1.13

  • pyclustering

  • pyod

  • scipy>=1.3.1

  • scikit_learn>=0.20.0

  • six

Optional Dependencies:

  • geomstats (used in the KARCH thresholder)

  • scikit-lego (used in the META thresholder)

  • joblib>=0.14.1 (used in the META thresholder)

  • pandas (used in the META thresholder)

API Cheatsheet

  • eval(score): evaluate outlier score.

Key Attributes of threshold:

  • thresh_: Return the threshold value that seperates inliers from outliers. Outliers are considered all values above this threshold value. Note the threshold value has been derived from normalized scores.

  • confidence_interval_: Return the lower and upper confidence interval of the contamination level. Only applies to the ALL thresholder

External Feature Cases

Towards Data Science: Thresholding Outlier Detection Scores with PyThresh

Available Thresholding Algorithms

Abbr

Description

References

Documentation

AUCP

Area Under Curve Precentage

[1]

pythresh.thresholds.aucp module

BOOT

Bootstrapping

[2]

pythresh.thresholds.boot module

CHAU

Chauvenet’s Criterion

[3]

pythresh.thresholds.chau module

CLF

Trained Linear Classifier

[4]

pythresh.thresholds.clf module

CLUST

Clustering Based

[5]

pythresh.thresholds.clust module

DECOMP

Decomposition

[6]

pythresh.thresholds.decomp module

DSN

Distance Shift from Normal

[7]

pythresh.thresholds.dsn module

EB

Elliptical Boundary

[8]

pythresh.thresholds.eb module

FGD

Fixed Gradient Descent

[9]

pythresh.thresholds.fgd module

FILTER

Filtering Based

[10]

pythresh.thresholds.filter module

FWFM

Full Width at Full Minimum

[11]

pythresh.thresholds.fwfm module

GESD

Generalized Extreme Studentized Deviate

[12]

pythresh.thresholds.gesd module

HIST

Histogram Based

[13]

pythresh.thresholds.hist module

IQR

Inter-Quartile Region

[14]

pythresh.thresholds.iqr module

KARCH

Karcher mean (Riemannian Center of Mass)

[15]

pythresh.thresholds.karch module

MAD

Median Absolute Deviation

[16]

pythresh.thresholds.mad module

MCST

Monte Carlo Shapiro Tests

[17]

pythresh.thresholds.mcst module

META

Meta-model Trained Classifier

[18]

pythresh.thresholds.meta module

MOLL

Friedrichs’ Mollifier

[19] [20]

pythresh.thresholds.moll module

MTT

Modified Thompson Tau Test

[21]

pythresh.thresholds.mtt module

OCSVM

One-Class Support Vector Machine

[22]

pythresh.thresholds.ocsvm module

QMCD

Quasi-Monte Carlo Discreprancy

[23]

pythresh.thresholds.qmcd module

REGR

Regression Based

[24]

pythresh.thresholds.regr module

WIND

Topological Winding Number

[25]

pythresh.thresholds.wind module

YJ

Yeo-Johnson Transformation

[26]

pythresh.thresholds.yj module

ZSCORE

Z-score

[27]

pythresh.thresholds.zscore module

ALL

All Thresholders Combined

None

pythresh.thresholds.all module

Implementations & Benchmarks

The comparison among implemented models and general implementation is made available below

For Jupyter Notebooks, please navigate to notebooks.

A quick look at all the thresholders performance can be found at “/notebooks/Compare All Models.ipynb”

Comparision_of_All

References

Please Note not all references’ exact methods have been employed in PyThresh. Rather, the references serve to demonstrate the validity of the threshold types available in PyThresh.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pythresh-0.2.7.tar.gz (75.3 kB view details)

Uploaded Source

File details

Details for the file pythresh-0.2.7.tar.gz.

File metadata

  • Download URL: pythresh-0.2.7.tar.gz
  • Upload date:
  • Size: 75.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.10

File hashes

Hashes for pythresh-0.2.7.tar.gz
Algorithm Hash digest
SHA256 1399671e8d5c94444e152bd545ee80d0e65c69b2f87780c058594fe9c1d2c8fb
MD5 4beef52c088b0021a34c3ff42a742543
BLAKE2b-256 5425e1f1ea2cdfae197b7a4b6a4b845840bddd0412a60b0ea0cf9bdc554ef9ad

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page