No project description provided
Project description
Module unavoids
Functions
getAllNCDFs(X, p=0.0625, ncpus=4)
: Calculate the NCDF for all samples in parallel using a
specified norm.
Parameters
----------
X : numpy array of shape (n_samples, m_features)
Data matrix where `n_samples` is the number of samples
and `n_features` is the number of features.
p : float or np.inf constant
The norm to use when calculating the distance between
samples in `X`. If np.inf is supplied, then Chebyshev
distance is used.
ncpus : int
The number of parallel processes.
Returns
----------
NCDFs : numpy array of shape (n_samples, n_samples)
The i-th row equals the NCDF for the i-th sample in `X`,
while the j-th column of the i-th row equals NCDF_xi(j)
getBetaFractions(NCDFs_L, BetaSorted, BetaRanks, fraction_WSS, index)
: Calculate the UNVAOIDS outlier score for a given sample using
the fractions of all gaps method.
Parameters
----------
NCDFs_L : numpy array of shape (n_samples, L_levels):
An array containing the intercepts for n NCDFs at L beta
levels, where `n_samples` is the number of samples and
`L_levels` is the number of beta levels.
BetaSorted : numpy array of shape (n_samples, L_levels):
Rhe same as `NCDFs_L` but the intercepts are sorted along
the L beta levels (column-wise sort of NCDFs_L).
BetaRanks : numpy array of shape (n_samples, L_levels):
The same as `NCDFs_L` but the value at `NCDFs_L[i,j]` is
replaced with the rank of `NCDFs_L[i,j]` on a given beta
horizontal.
fraction_WSS : int
The number of nearest intercepts to be encompassed by the
gap whose size will be the score for a given beta level
and NCDF intercept. Assumed to be less than
`n_samples/2`.
index : int
The row index of the NCDF in `NCDFs_L` which we are
finding the outlier score of.
Returns
----------
score : numpy array of shape (1, 1)
The highest outlier score for `NCDF_L[index,:]` across
all beta levels.
getBetaHist(NCDFs_L, BetaSorted, index)
: Calculate the UNVAOIDS outlier score for a given sample using
the histogram method.
Parameters
----------
NCDFs_L : numpy array of shape (n_samples, L_levels)
An array containing the intercepts for n NCDFs at L beta
levels, where `n_samples` is the number of samples and
`L_levels` is the number of beta levels.
BetaSorted : numpy array of shape (n_samples, L_levels)
Rhe same as `NCDFs_L` but the intercepts are sorted along
the L beta levels (column-wise sort of NCDFs_L).
index : int
The row index of the NCDF in `NCDFs_L` which we are
finding the outlier score of.
Returns
----------
score : numpy array of shape (1, 1)
The highest outlier score for `NCDF_L[index,:]` across
all beta levels.
getNCDF(X, p, index)
: Calculate the NCDF for a single sample using a specified
norm.
Parameters
----------
X : numpy array of shape (n_samples, m_features)
Data matrix, assumed to be min max scaled to [0,1], where
`n_samples` is the number of samples and `n_features` is
the number of features.
p : float or np.inf constant
The norm to use when calculating the distance between
samples in `X`. If np.inf is supplied, then Chebyshev
distance is used.
index : int
The index of the sample in `X` which we are finding the
NCDF of. Assumed to be less than `n_samples`.
Returns
----------
NCDFxi : numpy array of shape (1, m_features)
The NCDF of `X[i,:]` where i = `index` and the j-th value equals
NCDF_xi(j)
unavoidsScore(X, precomputed=False, p=0.0625, returnNCDFs=True, method='fractions', r=0.01, L=100, ncpus=4)
: Calculate the UNVAOIDS outlier score for all samples in 'X'.
Parameters
----------
X : numpy array of shape (n_samples, m_features)
Data matrix where `n_samples` is the number of samples
and `n_features` is the number of features.
precomputed : bool, default=True
If True, `X` is assumed to be an NCDF array in the same
format as that returned by `getAllNCDFs`.
p : float or np.inf constant
The norm to use when calculating the distance between
samples in `X`. If np.inf is supplied, then Chebyshev
distance is used.
returnNCDFs : bool, default=True
If True, NCDF array is returned along with outlier
scores.
method : {"fractions", "histogram"}, default="fractions"
Specifies which method to use for calculating outlier
scores; either "fractions" or "histogram".
r : float
Percentage of nearest intercepts to be encompassed by the
gap whose size will be the score for a given beta and
NCDF intercept in the "fractions" method. Ignored if
`method` == "histogram".
L : int
The number of beta levels to use.
ncpus : int
The number of parallel processes to use.
Returns
----------
scores : numpy array of shape (n_samples, 1)
The i-th element in scores is the UNAVOIDS outlier score
for the i-th sample(row) in `X`.
NCDFs : numpy array of shape (n_samples, n_samples)
The i-th row equals the NCDF for the i-th sample in `X`,
while the j-th column of the i-th row equals NCDF_xi(j).
Only returned if `returnNCDFs` == True.
References
----------
.. [1] W. A. Yousef, I. Traore and W. Briguglio, (2021)
"UN-AVOIDS: Unsupervised and Nonparametric Approach for
Visualizing Outliers and Invariant Detection Scoring",
IEEE Transactions on Information Forensics and Security,
vol. 16, pp. 5195-5210, [doi: 10.1109/TIFS.2021.3125608]
Examples
--------
>>> import numpy as np
>>> from joblib import load
>>> from unavoids import unavoids
>>> from sklearn import metrics
>>>
>>> X_all = load("simData.joblib")
>>> Y = np.zeros((X_all.shape[0],))
>>> Y[-3:] = 1 #last three samples are outliers
>>> X = X_all[:,:4] #grab first 4 features
>>>
>>> scores, NCDFs = unavoids.unavoidsScore(X, p=0.0625, returnNCDFs=True, method="fractions")
>>> fpr, tpr, thresholds = metrics.roc_curve(Y, scores)
>>> metrics.auc(fpr, tpr)
1.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.