Skip to main content

Self-Supervised Learning for Outlier Detection.

Project description

Self-Supervised Learning for Outlier Detection

The detection of outliers can be very challenging, especially if the data has features that do not carry information about the outlyingness of a point.For supervised problems, there are many methods for selecting appropriate features. For unsupervised problems it can be challenging to select features that are meaningful for outlier detection. We propose a method to transform the unsupervised problem of outlier detection into a supervised problem to mitigate the problem of irrelevant features and the hiding of outliers in these features. We benchmark our model against common outlier detection models and have clear advantages in outlier detection when many irrelevant features are present.

This repository contains the code used for the experiments, as well as instructions to reproduce our results. For reproduction of our results, please switch to the "publication" branch or click here.

As soon as our paper will be published online, the link for interested readers will appear here.

Installation

The software can be installed by using pip. We recommend to use a virtual environment for installation, for example venv. See the official guide.

To install our software, run

pip install noisy_outlier

Usage

For outlier detection, you can use the NoisyOutlierDetector as follows. The methods follow the scikit-learn syntax:

import numpy as np
from noisy_outlier import NoisyOutlierDetector
X = np.random.randn(50, 2)  # some sample data
model = NoisyOutlierDetector()
model.fit(X)
model.predict(X)  # returns binary decisions, 1 for outlier, 0 for inlier
model.predict_outlier_probability(X)  # predicts probability for being an outlier, this is the recommended way   

The NoisyOutlierDetector has several hyperpararameters such as the number of estimators for the classification problem or the pruning parameter. To our experience, the default values for the NoisyOutlierDetector provide stable results. However, you also have the choice to run routines for optimizing hyperparameters based on a RandomSearch. Details can be found in the paper. Use the HyperparameterOptimizer as follows:

import numpy as np
from scipy.stats.distributions import uniform, randint
from sklearn import metrics

from noisy_outlier import HyperparameterOptimizer, PercentileScoring
from noisy_outlier import NoisyOutlierDetector

X = np.random.randn(50, 5)
grid = dict(n_estimators=randint(50, 150), ccp_alpha=uniform(0.01, 0.3), min_samples_leaf=randint(5, 10))
optimizer = HyperparameterOptimizer(
                estimator=NoisyOutlierDetector(),
                param_distributions=grid,
                scoring=metrics.make_scorer(PercentileScoring(0.05), needs_proba=True),
                n_jobs=None,
                n_iter=5,
                cv=3,
            )
optimizer.fit(X)
# The optimizer is itself a `NoisyOutlierDetector`, so you can use it in the same way:
outlier_probability = optimizer.predict_outlier_probability(X)

Details about the algorithms may be found in our publication. If you use this work for your publication, please cite as follows. To reproduce our results, please switch to the "publication" branch or click here.

BibTeX Entry coming soon

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

noisy_outlier-0.1.3.tar.gz (6.1 kB view hashes)

Uploaded Source

Built Distribution

noisy_outlier-0.1.3-py3-none-any.whl (7.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page