Skip to main content

Research repository on outlier detection and analysis of heavy-tailed distributions.

Project description

OrderStatistics

A repository focused on outlier detection and analysis of heavy-tailed distributions using order statistics. Here you can find:

  • Simualations of order statistics for outlier detection and the statistic provided in our paper titled On a Notion of Outliers Based on Ratios of Order Statistics.
  • Single and double bootstrap methods (& more) for tail index estimation of heavy-tailed data.
  • Special kernel density estimation methods for heavy-tailed data.

Also:

  • Useful plotting functions for outlier detection.
  • Easy reporting tools for binary classification such as get_classification_report(), which reports all classification metrics from confusion matrix to AUC and can plot ROC curve.

Installation:

You can install the repository with:

pip install orderstats

Alternatively, you can download a copy of the repository from this page. After downloading, you can do pip install -r requirements.txt to install the requirements.

General explanation about Random Variables:

We use scipy.stats package for generating random variables. A random variable instance can be created just by giving the appropriate parameters. For example for X ~ N(0, 1), we can do:
>>> X = stats.norm(0, 1)
Once an instance is created, we can calculate pdf, or cdf using .pdf(), .cdf() methods, or, we can take a sample using .rvs():
>>> X.pdf(1.96) for pdf of X ~ N(0, 1) at x = 1.96;
>>> X.cdf(1.96) for cdf of X ~ N(0, 1) at x = 1.96;
>>> X.rvs(1000) for an i.i.d sample of 1000 from X ~ N(0, 1).

Some problems:

Not every random variable function in scipy.stats is intuitive.
For instance expon function creates an instance of an exponential distribution. However, if we wish to get an instance of exponential distribution with lambda = 2 (i.e. with pdf f(x) = 2e^(-2x)), then we would need to use expon(0, 1/2). In distributions.py, there are examples given for some popular distributions to clarify any ambiguities.

Examples

Outlier Detection:

Here is an example use of our method for a 1D dataset X:

from orderstats import scoring
from orderstats.distributions import moving_average_unscaled_kappa
from orderstats.plot_utils import plot_anomalies
scores, scores_sorted = scoring.get_anomaly_scores(X, scoring_func=moving_average_unscaled_kappa)
threshold = scoring.get_kappa_threshold(scores_sorted)
predictions = scores > threshold
plot_anomalies(X, predictions=predictions)

Order Statistics Simulation:

In general, we will use OrderSimulation class in distributions for simulations. Any random variable from the scipy.stats package can be given to this class as an argument. For example, if you wish to simulate the sums of first m order statistics from a sample of exponential distribution of size n:

from orderstats import OrderSimulation
simulate_normal_dist = OrderSimulation(stats.expon(0, 1), calculate_S_m)
simulation = simulate_normal_dist(10000, n, m)

For studying change point detection, we provide the MixSimulation class. For getting a mixed sample with corresponding ids:

from orderstats import MixSimulation
simulate_mixture = MixSimulation(dist1=expon(0, 1), dist2=stats.pareto(2.))
mixed_array, idx = simulate_mixture(n1, n2)

Tail Index Estimation:

Most of the methods for tail index estimation mentioned in [1] is implemented in tail_estimation. As an example for the double bootstrap method:

import numpy as np
N = 1000
pareto_sample = np.random.pareto(2, 1000)
sample_to_estimate_index = np.sort(X,)[::-1] # Sort decreasing
double_bootsrap = DoubleBootstrap()
tail_index = double_bootstrap(N, sample_to_estimate_index)

References:

[1] Markovich, N. (2008). Nonparametric analysis of univariate heavy-tailed data: research and practice (Vol. 753). John Wiley & Sons.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

orderstats-0.1.3.tar.gz (22.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

orderstats-0.1.3-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file orderstats-0.1.3.tar.gz.

File metadata

  • Download URL: orderstats-0.1.3.tar.gz
  • Upload date:
  • Size: 22.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.7

File hashes

Hashes for orderstats-0.1.3.tar.gz
Algorithm Hash digest
SHA256 f92331ceb61326c179b5700ba31b591f91f40bb6945dc7ac0917b23e868e1234
MD5 ac1580a56ff34b68a3d8c4825b59d5ee
BLAKE2b-256 071f9a9310576da76459db2d2be4ccdc4449e9f9a324cb4e99008c13738f112e

See more details on using hashes here.

File details

Details for the file orderstats-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: orderstats-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.7

File hashes

Hashes for orderstats-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5cf34dc1e4bffe541872d18deff0071305d18c7d4e2f0e3ba82293f5f8337dfd
MD5 f5e1015393a0f26fdf6808b222466d2d
BLAKE2b-256 2fb23fbf505f8272068da80a720e1d882ecd4ee014fba12e1c7b1376fb2ef67d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page