Skip to main content

A Python Toolbox for Outlier Detection Thresholding

Project description

Deployment, Stats, & License

PyPI version GitHub stars GitHub forks Downloads Python versions License

PyThresh is a comprehensive and scalable Python toolkit for thresholding outlier detection scores in univariate/multivariate data. It has been writen to work in tandem with PyOD with similar syntax and data structures. However, it is not limited to this single library to achieve good results. PyThresh is meant to threshold scores generated by an outlier detection. It thresholds scores without the need to set a contamination level or have the user guess the amount of outliers that may exist in the dataset beforehand. These non-parametric methods were written to reduce the user’s input/guess work and rely on statistics instead to threshold outlier scores. The scores needed to apply thresholing correctly must follow these rules: the higher the score, the higher the probability that it is an outlier in the dataset. All threshold functions return a binary array where 0 values represent inliers, while 1 values are outliers.

PyThresh includes more than 30 thresholding algorithms. These algorithms range from using simple statistical analysis like the Z-score to more complex mathematical methods that involve graph theory and topology.

Outlier Detection Thresholding with 7 Lines of Code:

# train the KNN detector
from pyod.models.knn import KNN
from pythresh.thresholds.dsn import DSN

clf = KNN()
clf.fit(X_train)

# get outlier scores
decision_scores = clf.decision_scores_  # raw outlier scores on the train data

# get outlier labels
thres = DSN()
labels = thres.eval(decision_scores)

Installation

It is recommended to use pip for installation:

pip install pythresh            # normal install
pip install --upgrade pythresh  # or update if needed

Alternatively, you could clone and run setup.py file:

git clone https://github.com/KulikDM/pythresh.git
cd pythresh
pip install .

Required Dependencies:

  • geomstats

  • matplotlib

  • numpy>=1.13

  • pyod

  • scipy>=1.3.1

  • scikit_learn>=0.20.0

  • six

API Cheatsheet

  • eval(score): evaluate outlier score.

Key Attributes of threshold:

  • thresh_: Return the threshold value that seperates inliers from outliers. Outliers are considered all values above this threshold value. Note the threshold value has been derived from normalized scores.

Implemented Algorithms

(i) Individual Thresholding Algorithms :

Abbr

Description

Parameters

AUCP

Area Under Curve Precentage [1]

None

BOOT

Bootstrapping [2]

None

CHAU

Chauvenet’s Criterion [3]

method: [default=’mean’, ‘median’, ‘gmean’]

CLF

Trained Classifier [4]

None

DSN

Distance Shift from Normal [5]

metric: [default = ‘JS’: Jensen-Shannon, ‘WS’: Wasserstein, ‘ENG’: Energy, ‘BHT’: Bhattacharyya, ‘HLL’: Hellinger, ‘HI’: Histogram intersection, ‘LK’: Lukaszyk–Karmowski metric for normal distributions, ‘LP’: Levy-Prokhorov, ‘MAH’: Mahalanobis, ‘TMT’: Tanimoto, ‘RES’: Studentized residual distance, ‘KS’: Kolmogorov–Smirnov]

EB

Elliptical Boundary [6]

None

FGD

Fixed Gradient Descent [7]

None

FILTER

Filtering Based [8]

method: [‘gaussian’, ‘savgol’, ‘hilbert’, default = ‘wiener’, ‘medfilt’, ‘decimate’, ‘detrend’, ‘resample’]; sigma: int, default=’native’

FWFM

Full Width at Full Minimum [9]

None

GESD

Generalized Extreme Studentized Deviate [10]

max_outliers: int, default=’native’; alpha: float, default=0.05

HIST

Histogram Based [11]

n_bins: int, default=’native’, method: [default=’otsu’, ‘yen’, ‘isodata’, ‘li’, ‘minimum’, ‘triangle’]

IQR

Inter-Qaurtile Region [12]

None

KARCH

Karcher mean (Riemannian Center of Mass) [13]

ndim: int, default = 2; method: [‘simple’, default = ‘complex’]

KMEANS

K-means Clustering [14]

None

MAD

Median Absolute Deviation [15]

None

MCST

Monte Carlo Shapiro Tests [16]

None

MOLL

Friedrichs’ Mollifier [17] [18]

None

MTT

Modified Thompson Tau Test [19]

strictness: [1,2,3,default=4,5]

QMCD

Quasi-Monte Carlo Discreprancy [20]

method: [‘CD’, default=’WD’, ‘MD’, ‘L2-star’], lim: [‘Q’, default=’P’]

REGR

Regression Based [21]

method: [default=’siegel’, ‘theil’]

SHIFT

Mean Shift Clustering [22]

None

WIND

Topological Winding Number [23]

None

YJ

Yeo-Johnson Transformation [24]

None

ZSCORE

Z-score [25]

None

ALL

All Thresholders Combined

thresholders: list, default=’all’; max_contam: float, default=0.5; method: [default=’mean’, ‘median’, ‘gmean’]

Implementations & Benchmarks

The comparison among implemented models and general implementation is made available below

For Jupyter Notebooks, please navigate to notebooks.

A quick look at all the thresholders performance can be found at “/notebooks/Compare All Models.ipynb”

Comparision_of_All

References

Please Note not all references’ exact methods have been employed in PyThresh. Rather, the references serve to demonstrate the validity of the threshold types available in PyThresh.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pythresh-0.2.1.tar.gz (27.8 kB view details)

Uploaded Source

File details

Details for the file pythresh-0.2.1.tar.gz.

File metadata

  • Download URL: pythresh-0.2.1.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.10

File hashes

Hashes for pythresh-0.2.1.tar.gz
Algorithm Hash digest
SHA256 6075b39d6469b4600d1022787999e48ef01098ec2576ce9aa03c6561007784bc
MD5 a07817d371d13495c1d301e1d0446b58
BLAKE2b-256 308e9886485dd1772096f0c605a56ba07004ba0c10c1023f7559319ae43204be

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page