Skip to main content

A scikit-learn compatible implementation of Stochastic Outlier Selection (SOS) for detecting outliers.

Project description

scikit-sos is a Python module for Stochastic Outlier Selection (SOS). It is compatible with scikit-learn. SOS is an unsupervised outlier selection algorithm. It uses the concept of affinity to compute an outlier probability for each data point.

SOS

For more information about SOS, see the technical report: J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik. Stochastic Outlier Selection. Technical Report TiCC TR 2012-001, Tilburg University, Tilburg, the Netherlands, 2012.

Install

pip install scikit-sos

Usage

>>> import pandas as pd
>>> from sksos import SOS
>>> iris = pd.read_csv("http://bit.ly/iris-csv")
>>> X = iris.drop("Name", axis=1).values
>>> iris["score"] = detector.predict(X)
>>> iris.sort_values("score", ascending=False).head(10)
     SepalLength  SepalWidth  PetalLength  PetalWidth             Name     score
41           4.5         2.3          1.3         0.3      Iris-setosa  0.981898
106          4.9         2.5          4.5         1.7   Iris-virginica  0.964381
22           4.6         3.6          1.0         0.2      Iris-setosa  0.957945
134          6.1         2.6          5.6         1.4   Iris-virginica  0.897970
24           4.8         3.4          1.9         0.2      Iris-setosa  0.871733
114          5.8         2.8          5.1         2.4   Iris-virginica  0.831610
62           6.0         2.2          4.0         1.0  Iris-versicolor  0.821141
108          6.7         2.5          5.8         1.8   Iris-virginica  0.819842
44           5.1         3.8          1.9         0.4      Iris-setosa  0.773301
100          6.3         3.3          6.0         2.5   Iris-virginica  0.765657

Selecting outliers from the command line

This module also includes a command-line tool called sos. To illustrate, we apply SOS with a perplexity of 10 to the Iris dataset:

$ curl -sL http://bit.ly/iris-csv |
> tail -n +2 | cut -d, -f1-4 |
> sos -p 10 |
> sort -nr | head
0.98189840
0.96438132
0.95794492
0.89797043
0.87173299
0.83161045
0.82114072
0.81984209
0.77330148
0.76565738

Adding a threshold causes SOS to output 0s and 1s instead of outlier probabilities. If we set the threshold to 0.8 then we see that out of the 150 data points, 8 are selected as outliers:

$ curl -sL http://bit.ly/iris-csv |
> tail -n +2 | cut -d, -f1-4 |
> sos -p 10 -t 0.8 |
> paste -sd+ | bc
8

License

All software in this repository is distributed under the terms of the BSD Simplified License. The full license is in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scikit-sos-0.1.10.tar.gz (4.9 kB view details)

Uploaded Source

File details

Details for the file scikit-sos-0.1.10.tar.gz.

File metadata

  • Download URL: scikit-sos-0.1.10.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for scikit-sos-0.1.10.tar.gz
Algorithm Hash digest
SHA256 a64de8e093b6bd340a2fed6d48078def54cc997aaae591a330f45e16e30bbf94
MD5 0c783ee23859ffeb9729cf8c2d129703
BLAKE2b-256 f8273fe58a96f5b0695026b8662082e65782010ae227c521a5e9ebd1be34599d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page