Skip to main content

A Python Toolbox for Outlier Detection Thresholding

Project description

PyThresh is a comprehensive and scalable Python toolkit for thresholding detected possible outliers in univariate/multivariate data. It has been writen to work in tandem with PyOD with similar syntax and data structures. However, it is not limited to this single library to achieve good results. PyThresh is meant to threshold scores generated by an outlier detection. It thresholds scores without the need to set a contamination level or guess the amount of outliers that may be in the dataset beforehand. These non-parametric methods were written to reduce the user’s input/guess work and rely on statistics instead to threshold outlier scores. The scores needed to apply thresholing correctly must follow: the higher the score, the higher the probability that it is an outlier in the dataset. All threshold functions return a binary array where 0 values represent inliers, while 1 values are outliers.

PyThresh includes more than 30 thresholding algorithms. These alogorithms range from using simple statistical analysis like the Z-score to more complex mathematical methods that involve graph theory and topology.

Outlier Detection Thresholding with 7 Lines of Code:

# train the KNN detector
from pyod.models.knn import KNN
from pythresh.thresholds.dsn import DSN

clf = KNN()
clf.fit(X_train)

# get outlier scores
decision_scores = clf.decision_scores_  # raw outlier scores on the train data

# get outlier labels
thres = DSN()
labels = thresh.eval(decision_scores)

Installation

It is recommended to use pip or conda for installation:

pip install pythresh            # normal install
pip install --upgrade pythresh  # or update if needed
conda install -c conda-forge pythresh

Alternatively, you could clone and run setup.py file:

git clone https://github.com/KulikDM/pythresh
cd pythresh
pip install .

Required Dependencies:

  • matplotlib

  • numpy>=1.13

  • scipy>=1.3.1

  • scikit_learn>=0.20.0

  • six

  • pyod

API Cheatsheet

  • eval(score): evaluate outlier score.

Key Attributes of a fitted model:

  • thresh_: Return the threshold value that seperates inliers from outliers. Outliers are considered all values above this threshold value. Note the threshold value has been derived from normalized scores.

Implemented Algorithms

(i) Individual Detection Algorithms :

Abbr

Description

Parameters

AUCP

Area Under Curve Precentage thresholder

None

BOOT

Bootstrapping thresholder

None

CHAU

Chauvenet’s criterion thresholder

method: [‘mean’, ‘median’, default=’gmean’]

CLF

Trained Classifier thresholder

None

DSN

Distance Shift from Normal thresholder

metric: [‘JS’: Jensen-Shannon, ‘WS’: Wasserstein, ‘ENG’: Energy,

_

_

‘BHT’: Bhattacharyya, ‘HLL’: Hellinger ‘HI’: Histogram intersection,

_

_

default = ‘LK’: Lukaszyk–Karmowski metric for normal distributions,

_

_

‘LP’: Levy-Prokhorov, ‘MAH’: Mahalanobis, ‘TMT’: Tanimoto,

_

_

‘RES’: Studentized residual distance]

EB

Elliptical Boundary thresholder

None

FGD

Fixed Gradient Descent thresholder

None

FWFM

Full Width at Full Minimum thresholder

None

GESD

Generalized Extreme Studentized Deviate thresholder

max_outliers: int, default=None; alpha: float, default=0.05

GF

Gaussian Filter thresholder

None

HIST

Histogram based thresholders

n_bins: int, default=None, method: [default=’otsu’, ‘yen’, ‘isodata’, ‘li’,

_

_

‘minimum’, ‘triangle’]

IQR

Inter-Qaurtile Region thresholder

None

KMEANS

KMEANS clustering thresholder

None

MAD

Median Absolute Deviation thresholder

None

MCST

Monte Carlo Shapiro Tests thresholder

None

MOLL

Friedrichs’ mollifier thresholder

None

MTT

Modified Thompson Tau test thresholder

strictness: [1,2,3,default=4,5]

QMCD

Quasi-Monte Carlo Discreprancy thresholder

method: [‘CD’, default=’WD’, ‘MD’, ‘L2-star’], lim: [‘Q’, default=’P’]

REGR

Regression based thresholder

method: [default=’siegel’, ‘theil’]

SHIFT

Mean Shift clustering thresholder

None

WIND

Topological Winding number thresholder

None

YJ

Yeo-Johnson transformation thresholder

None

ZSCORE

ZSCORE thresholder

None

Implementations & Benchmarks

The comparison among implemented models and general implementation is made available below

For Jupyter Notebooks, please navigate to “/notebooks/”.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pythresh-0.1.0.tar.gz (20.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page