Project description

Self-Supervised Learning for Outlier Detection

The detection of outliers can be very challenging, especially if the data has features that do not carry information about the outlyingness of a point.For supervised problems, there are many methods for selecting appropriate features. For unsupervised problems it can be challenging to select features that are meaningful for outlier detection. We propose a method to transform the unsupervised problem of outlier detection into a supervised problem to mitigate the problem of irrelevant features and the hiding of outliers in these features. We benchmark our model against common outlier detection models and have clear advantages in outlier detection when many irrelevant features are present.

This repository contains the code used for the experiments, as well as instructions to reproduce our results. For reproduction of our results, please switch to the "publication" branch or click here.

As soon as our paper will be published online, the link for interested readers will appear here.

Installation

The software can be installed by using pip. We recommend to use a virtual environment for installation, for example venv. See the official guide.

To install our software, run

pip install noisy_outlier

Usage

For outlier detection, you can use the NoisyOutlierDetector as follows. The methods follow the scikit-learn syntax:

import numpy as np
from noisy_outlier import NoisyOutlierDetector
X = np.random.randn(50, 2)  # some sample data
model = NoisyOutlierDetector()
model.fit(X)
model.predict(X)  # returns binary decisions, 1 for outlier, 0 for inlier
model.predict_outlier_probability(X)  # predicts probability for being an outlier, this is the recommended way

The NoisyOutlierDetector has several hyperpararameters such as the number of estimators for the classification problem or the pruning parameter. To our experience, the default values for the NoisyOutlierDetector provide stable results. However, you also have the choice to run routines for optimizing hyperparameters based on a RandomSearch. Details can be found in the paper. Use the HyperparameterOptimizer as follows:

import numpy as np
from scipy.stats.distributions import uniform, randint
from sklearn import metrics

from noisy_outlier import HyperparameterOptimizer, PercentileScoring
from noisy_outlier import NoisyOutlierDetector

X = np.random.randn(50, 5)
grid = dict(n_estimators=randint(50, 150), ccp_alpha=uniform(0.01, 0.3), min_samples_leaf=randint(5, 10))
optimizer = HyperparameterOptimizer(
                estimator=NoisyOutlierDetector(),
                param_distributions=grid,
                scoring=metrics.make_scorer(PercentileScoring(0.05), needs_proba=True),
                n_jobs=None,
                n_iter=5,
                cv=3,
            )
optimizer.fit(X)
# The optimizer is itself a `NoisyOutlierDetector`, so you can use it in the same way:
outlier_probability = optimizer.predict_outlier_probability(X)

Details about the algorithms may be found in our publication. If you use this work for your publication, please cite as follows. To reproduce our results, please switch to the "publication" branch or click here.

BibTeX Entry coming soon

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.6

May 12, 2023

0.1.5

May 12, 2023

0.1.4

Aug 19, 2021

This version

0.1.3

Aug 19, 2021

0.1.2

Sep 1, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

noisy_outlier-0.1.3.tar.gz (6.1 kB view hashes)

Uploaded Aug 19, 2021 Source

Built Distribution

noisy_outlier-0.1.3-py3-none-any.whl (7.5 kB view hashes)

Uploaded Aug 19, 2021 Python 3

Hashes for noisy_outlier-0.1.3.tar.gz

Hashes for noisy_outlier-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`f1516f1b5425affcdfedafc7ae1d3d181ff552a214ff8ded6e4d888a3b6cd9fe`
MD5	`17f6dc50ea586fd5d44197f6b34120b5`
BLAKE2b-256	`92f19f8ac066e861abe2f00977d6bf64bfdb70d68fcd991023c8a9c6051ba8bb`

Hashes for noisy_outlier-0.1.3-py3-none-any.whl

Hashes for noisy_outlier-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ef5cca013b849d9f9c3cd3ee110d98ecf083924118a4ce1f0e202edbca5e7eed`
MD5	`6f2ac2cb80676cf70c0f10e393a17e25`
BLAKE2b-256	`5993272eef5893a9f11e1fc7b761745449e4b3d9689c822992621ca4d6d5b3d5`