Skip to main content

Tool for data preprocess in ML.

Project description

neulab

Tool for data preprocess in ML.

Downloads

Installation

pip install neulab

Usage

Algorithms

Mean

Algorithm for calculating the average value

from neulab.Algorithms import Mean

d = {'col1': [0, 1, 2]}
df = pd.DataFrame(data=d)

mean = Mean(vector=df.col1)

Output: 1

Median

Algorithm for calculating the median value

from neulab.Algorithms import Median
d = {'col1': [0, 1, 2, 3]}

df = pd.DataFrame(data=d)
median = Median(vector=df.col1)

Output: 1.5

Mode

Mode Calculation Algorithm

from neulab.Algorithms import Mode

d = {'col1': [0, 1, 2, 3, 1]}
df = pd.DataFrame(data=d)

mode = Mode(vector=df.col1)

Output: 1

Standart Deviation

Standard Deviation Algorithm

from neulab.Algorithms import StdDeviation

d = {'col1': [8.02, 8.16, 3.97, 8.64, 0.84, 4.46, 0.81, 7.74, 8.78, 9.26, 20.46, 29.87, 10.38, 25.71]}
df = pd.DataFrame(data=d)
std = StdDeviation(df.col1)

Output: 8.767464705525615

Is Symmetric

Detects if vector is symmetric or asymmetric.

from neulab.Algorithms import IsSymmetric

d = {'col1': [8.02, 8.16, 3.97, 8.64, 0.84, 4.46, 0.81, 7.74, 8.78, 9.26, 20.46, 29.87, 10.38, 25.71]}
df = pd.DataFrame(data=d)
symmtr = IsSymmetric(vector=df.col1)

Output: True

Euclid metric

Algorithm for calculating the distance using the Euclidean metric

from neulab.Algorithms import EuclidMertic

d = {'col1': [0, 1, 2], 'col2': [2, 1, 0]}
df = pd.DataFrame(data=d)

euqld  = EuclidMertic(vector1=df.col1, vector2=df.col2) 

Output: 2.8284271247461903

Manhattan metric

Algorithm for calculating the distance using the Manhattan metric

from neulab.Algorithms import ManhattanMetric

d = {'col1': [0, 1, 2], 'col2': [2, 1, 0]}
df = pd.DataFrame(data=d)

mnht = ManhattanMetric(vector1=df.col1, vector2=df.col2) 

Output: 4.0

Max Metric

Algorithm for calculating the distance using the Max metric

from neulab.Algorithms import MaxMetric

d = {'col1': [0, 1, 2], 'col2': [2, 1, 0]}
df = pd.DataFrame(data=d)

mx = MaxMetric(vector1=df.col1, vector2=df.col2)

Output: 2

Correlation Coefficient

Algorithm for calculating the correlation coefficient

from neulab.Algorithms import CorrelationCoefficient

d = {'col1': [99, 89, 91, 91, 86, 97], 'col2': [58, 48, 54, 54, 44, 56]}
df = pd.DataFrame(data=d)

cc = CorrelationCoefficient(vector1=df.col1, vector2=df.col2)

Output: 0.906843948104356

Restore value methods

You have to send df with NaN value. It is important that there is only one NaN in the table..

Example:

   P1  P2  P3  P4   P5
0   3   4   5   3  4.0
1   5   5   5   4  3.0
2   4   3   3   2  5.0
3   5   4   3   3  NaN

Recovery with metrics

from neulab.RestoreValue import MetricRestore

d = {'P1': [3, 5, 4, 5], 'P2': [4, 5, 3, 4], 'P3': [5, 5, 3, 3], 'P4': [3, 4, 2, 3], 'P5': [4, 3, 5, np.NaN]}
df = pd.DataFrame(data=d)

# Euclid
euclid_m = MetricRestore(df, row_start=0, row_end=9, metric='euclid')
# Manhattan
mnht_m = MetricRestore(df, row_start=0, row_end=9, metric='manhattan')
# Max
mx_m = MetricRestore(df, row_start=0, row_end=9, metric='max')

Output: 
euclid_m = 4.13
mnht_m = 4.1
mx_m = 4.25

Recovery with correlation coefficient

from neulab.RestoreValue import CorrCoefRestore

d = {'G': [99, 89, 91, 91, 86, 97, np.NaN], 'T': [56, 58, 64, 51, 56, 53, 51], 'B': [91, 89, 91, 91, 84, 86, 91], 'R': [160, 157, 165, 170, 157, 175, 165], 'W': [58, 48, 54, 54, 44, 56, 54]}
df = pd.DataFrame(data=d)

cc = CorrCoefRestore(df=df, row_start=0, row_end=9)

Output: 94.21

Outliers Detection

Simple Outlier Detection Algorithm

Algorithm detects and removes (if autorm is True) rows containing the outlier. Returns cleared dataframe.

from neulab.OutlierDetection import SimpleOutDetect

d = {'col1': [1, 0, 342, 1, 1, 0, 1, 0, 1, 255, 1, 1, 1, 0, ]}
df = pd.DataFrame(data=d)

sd = SimpleOutDetect(dataframe=df, info=False, autorm=True)

Output: Detected outliers: {'col1': [342, 255]}

index	col1
0	   1
1	   0
3	   1
4	   1
5	   0 
6	   1
7	   0
8	   1
10	   1
11	   1
12	   1
13	   0

Chauvenet Algorithm

Chauvenet Algorithm detects and removes (if autorm is True) rows containing the outlier. Returns cleared dataframe.

from neulab.OutlierDetection import Chauvenet

d = {'col1': [8.02, 8.16, 3.97, 8.64, 0.84, 4.46, 0.81, 7.74, 8.78, 9.26, 20.46, 29.87, 10.38, 25.71], 'col2': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(data=d)

chvn = Chauvenet(dataframe=df, info=True, autorm=True)

Output: Detected outliers: {'col1': [29.87, 25.71, 20.46, 0.84, 0.81, 3.97, 4.46, 10.38, 7.74, 9.26]}

	col1	col2
0	8.02	1
1	8.16	1
3	8.64	1
8	8.78	1

Quratile algorithm

Quratile algorithm doest use standart deviation and average mean. Remove all outliers from the vector. Returns cleared dataframe is autorm is True.

from neulab.OutlierDetection import Quratile

d = {'col1': [-6, 0, 1, 2, 4, 5, 5, 6, 7, 100], 'col2': [-1, 0, 1, 2, 0, 0, 1, 0, 50, 13]}
df = pd.DataFrame(data=d)

qurtl = Quratile(dataframe=df, info=True, autorm=True)

Output: Detected outliers: {'col1': [-6, 100], 'col2': [50]}

index col1	col2
1	   0	   0
2	   1	   1
3	   2	   2
4	   4	   0
5	   5	   0
6	   5	   1
7	   6	   0

Metric algorithm

An outlier search algorithm using metrics. The metrics calculate the distance between features and then filter using the quantile algorithm. Returns cleared dataframe if autorm is True.

from neulab.OutlierDetection import DistQuant

d = {'col1': [-6, 0, 1, 2, 4, 5, 5, 6, 7, 100], 'col2': [-1, 0, 1, 2, 0, 0, 1, 0, 50, 13]}
df = pd.DataFrame(data=d)

mdist = DistQuant(dataframe=df, metric='manhattan', filter='quantile', info=True, autorm=True)

Output: Distances: {0: 260.0, 1: 204.0, 2: 198.0, 3: 198.0, 4: 190.0, 5: 190.0, 6: 190.0, 7: 194.0, 8: 566.0, 9: 1014.0}

index col1	col2
1	   0	0
2	   1	1
3          2    2
4	   4	0
5	   5	0
6	   5	1
7	   6	0

Dixon’s Q Test

Dixon’s Q test, or just the “Q Test” is a way to find outliers in very small, normally distributed, data sets. Small data sets are usually defined as somewhere between 3 and 7 items. It’s commonly used in chemistry, where data sets sometimes include one suspect observation that’s much lower or much higher than the other values. Keeping an outlier in data affects calculations like the mean and standard deviation, so true outliers should be removed. This test should be used sparingly and never more than once in a data set. Returns cleared dataframe if autorm is True.

from neulab.OutlierDetection import DixonTest
d = {'col1': [2131, 180, 188, 177, 181, 185, 189], 'col2': [0, 0, 0, 0, 1, 13, 1]}
df = pd.DataFrame(data=d)
qtest = DixonTest(dataframe=df, q=95, info=True, autorm=True)

Output:
Detected outlier: 
   col1  col2
0  2131     0
Detected outlier: 
   col1  col2
5   185    13
index 		col1		col2
1		180		0
2		188		0
3		177		0
4		181		1
6		189		1

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

neulab-0.0.22-py3-none-any.whl (11.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page