Concentration Free Outlier Factor

# CFOF (Concentration Free Outlier Factor)

🚧 Work In Progress..

Python implementation of Concentration Free Outlier Factor (CFOF) [1].

## CFOF properties

• Concentration free
• Does not suffer of the hubness problem
• Semi–locality
• fast-CFOF algorithm allows to calculate reliably CFOF scores with linear cost both in the dataset size and dimensionality

## Installation

To install the latest release:

\$ pip install cfof


## Usage

Import CFOF and FastCFOF.

>>> from cfof import CFOF, FastCFOF
>>> import numpy as np


>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])


Instantiate CFOF or FastCFOF, then call .compute(X) to calculate the scores. .compute(X) returns sc, where sc[i, l] is score of object i for ϱ_l (rhos[l]).

You can also calculate CFOF scores from a precomputed distance matrix using .compute_from_distance_matrix().

### CFOF (hard-CFOF)

Use compute to compute CFOF scores directly from data.

>>> cfof_clf = CFOF(metric='euclidean', rhos=[0.5, 0.6], n_jobs=1)
>>> cfof_clf.compute(X)
array([[0.5       , 0.66666667],
[0.33333333, 0.83333333],
[0.5       , 1.        ],
[0.5       , 0.66666667],
[0.33333333, 0.83333333],
[0.5       , 1.        ]])


Use compute_from_distance_matrix to compute CFOF scores from a precomputed distance matrix.

>>> from sklearn.metrics import pairwise_distances
>>> distance_matrix = pairwise_distances(X, metric='euclidean')
>>> cfof_clf.compute_from_distance_matrix(distance_matrix)
array([[0.5       , 0.66666667],
[0.33333333, 0.83333333],
[0.5       , 1.        ],
[0.5       , 0.66666667],
[0.33333333, 0.83333333],
[0.5       , 1.        ]])


### FastCFOF (soft-CFOF)

Use compute to compute CFOF scores directly from data.

>>> np.random.seed(10)
>>> X = np.random.randint(0, 100, size=(1000, 3))
>>>
>>> fast_cfof_clf = FastCFOF(metric='euclidean',
...                          rhos=[0.001, 0.005, 0.01, 0.05, 0.1],
...                          epsilon=0.1, delta=0.1, n_bins=50, n_jobs=1)
>>> fast_cfof_clf.compute(X)
array([[0.00954095, 0.00954095, 0.01930698, 0.05963623, 0.10481131],
[0.00954095, 0.00954095, 0.01930698, 0.06866488, 0.10481131],
[0.00954095, 0.00954095, 0.02559548, 0.06866488, 0.10481131],
...,
[0.00954095, 0.00954095, 0.01930698, 0.05963623, 0.10481131],
[0.00954095, 0.00954095, 0.03393222, 0.15998587, 0.24420531],
[0.00954095, 0.00954095, 0.02559548, 0.0390694 , 0.09102982]])


Use compute_from_distance_matrix to compute CFOF scores from a precomputed distance matrix.

>>> from sklearn.metrics import pairwise_distances
>>> distance_matrix = pairwise_distances(X, metric='euclidean')
>>> fast_cfof_clf.compute_from_distance_matrix(distance_matrix)
array([[0.00954095, 0.00954095, 0.01930698, 0.05963623, 0.10481131],
[0.00954095, 0.00954095, 0.01930698, 0.06866488, 0.10481131],
[0.00954095, 0.00954095, 0.02559548, 0.06866488, 0.10481131],
...,
[0.00954095, 0.00954095, 0.01930698, 0.05963623, 0.10481131],
[0.00954095, 0.00954095, 0.03393222, 0.15998587, 0.24420531],
[0.00954095, 0.00954095, 0.02559548, 0.0390694 , 0.09102982]])


### CFOFiSAX

This library provides a wrapper for pyCFOFiSAX [1]

>>> from cfof.cfof_isax import CFOFiSAXWrapper


Refer to pyCFOFiSAX documentation for more details.

## References

[1] ANGIULLI, Fabrizio. CFOF: a concentration free measure for anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 2020, vol. 14, no 1, p. 1-53.

[2] FOULON, Lucas, FENET, Serge, RIGOTTI, Christophe, et al. Scoring Message Stream Anomalies in Railway Communication Systems. In : 2019 International Conference on Data Mining Workshops (ICDMW). IEEE, 2019. p. 769-776.

## Project details

Uploaded source
Uploaded py3