Skip to main content

A Python Toolbox for Outlier Detection Thresholding

Project description

PyThresh is a comprehensive and scalable Python toolkit for thresholding detected possible outliers in univariate/multivariate data. It has been writen to work in tandem with PyOD with similar syntax and data structures. However, it is not limited to this single library to achieve good results. PyThresh is meant to threshold scores generated by an outlier detection. It thresholds scores without the need to set a contamination level or guess the amount of outliers that may be in the dataset beforehand. These non-parametric methods were written to reduce the user’s input/guess work and rely on statistics instead to threshold outlier scores. The scores needed to apply thresholing correctly must follow: the higher the score, the higher the probability that it is an outlier in the dataset. All threshold functions return a binary array where 0 values represent inliers, while 1 values are outliers.

PyThresh includes more than 30 thresholding algorithms. These alogorithms range from using simple statistical analysis like the Z-score to more complex mathematical methods that involve graph theory and topology.

Outlier Detection Thresholding with 7 Lines of Code:

# train the KNN detector
from pyod.models.knn import KNN
from pythresh.thresholds.dsn import DSN

clf = KNN()
clf.fit(X_train)

# get outlier scores
decision_scores = clf.decision_scores_  # raw outlier scores on the train data

# get outlier labels
thres = DSN()
labels = thres.eval(decision_scores)

Installation

It is recommended to use pip or conda for installation:

pip install pythresh            # normal install
pip install --upgrade pythresh  # or update if needed
conda install -c conda-forge pythresh

Alternatively, you could clone and run setup.py file:

git clone https://github.com/KulikDM/pythresh
cd pythresh
pip install .

Required Dependencies:

  • matplotlib

  • numpy>=1.13

  • scipy>=1.3.1

  • scikit_learn>=0.20.0

  • six

  • pyod

API Cheatsheet

  • eval(score): evaluate outlier score.

Key Attributes of threshold:

  • thresh_: Return the threshold value that seperates inliers from outliers. Outliers are considered all values above this threshold value. Note the threshold value has been derived from normalized scores.

Implemented Algorithms

(i) Individual Thresholding Algorithms :

Abbr

Description

Parameters

AUCP

Area Under Curve Precentage

None

BOOT

Bootstrapping

None

CHAU

Chauvenet’s Criterion

method: [‘mean’, ‘median’, default=’gmean’]

CLF

Trained Classifier

None

DSN

Distance Shift from Normal

metric: [‘JS’: Jensen-Shannon, ‘WS’: Wasserstein, ‘ENG’: Energy, ‘BHT’: Bhattacharyya, ‘HLL’: Hellinger ‘HI’: Histogram intersection, default = ‘LK’: Lukaszyk–Karmowski metric for normal distributions, ‘LP’: Levy-Prokhorov, ‘MAH’: Mahalanobis, ‘TMT’: Tanimoto, ‘RES’: Studentized residual distance]

EB

Elliptical Boundary

None

FGD

Fixed Gradient Descent

None

FWFM

Full Width at Full Minimum

None

GESD

Generalized Extreme Studentized Deviate

max_outliers: int, default=None; alpha: float, default=0.05

GF

Gaussian Filter

None

HIST

Histogram Based

n_bins: int, default=None, method: [default=’otsu’, ‘yen’, ‘isodata’, ‘li’, ‘minimum’, ‘triangle’]

IQR

Inter-Qaurtile Region

None

KMEANS

K-means Clustering

None

MAD

Median Absolute Deviation

None

MCST

Monte Carlo Shapiro Tests

None

MOLL

Friedrichs’ Mollifier

None

MTT

Modified Thompson Tau Test

strictness: [1,2,3,default=4,5]

QMCD

Quasi-Monte Carlo Discreprancy

method: [‘CD’, default=’WD’, ‘MD’, ‘L2-star’], lim: [‘Q’, default=’P’]

REGR

Regression Based

method: [default=’siegel’, ‘theil’]

SHIFT

Mean Shift Clustering

None

WIND

Topological Winding Number

None

YJ

Yeo-Johnson Transformation

None

ZSCORE

Z-score

None

ALL

All Thresholders Combined

thresholders: list, default=’all’; max_contam: float, default=0.5; method: [default=’mean’, ‘median’, ‘gmean’]

Implementations & Benchmarks

The comparison among implemented models and general implementation is made available below

For Jupyter Notebooks, please navigate to notebooks.

A quick look at all the thresholders performance can be found at “/notebooks/Compare All Models.ipynb”

![](https://raw.githubusercontent.com/KulikDM/pythresh/blob/ceccb69abbd76692f60b77570f381c11bc49d3a8/imgs/All.png)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pythresh-0.1.3.tar.gz (21.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page