Python package for Detecting and Handling missing values with Visualizations
Project description
handlemissingvalues
Package Description :
Python package for Detecting and Handling missing values by visualizing and applying different algorithms.
Motivation :
This is a part of project  III made for UCS633  Data analytics and visualization at TIET.
@Author : Sourav Kumar
@Roll no. : 101883068
Knowledge of missing values :
Before handling, we have to sometimes watch out for the reason behind the missing values.
There are various reasons 
 Missingness completely at random
 Missingness at random
 Missingness that depends on unobserved predictors
 Missingness that depends on the missing value itself
Algorithm :

Row removal / Column removal : It removes rows or columns (based on arguments) with missing values / NaN.
Python's pandas library provides a function to remove rows or columns from a dataframe which contain missing values or NaN.
It will remove all the rows which had any missing value. It will not modify the original dataframe, it just returns a copy with modified contents.
Default value of 'how' argument in dropna() is 'any' & for 'axis' argument it is 0. It means if we don't pass any argument in dropna() then still it will delete all the rows with any NaN. 
Statistical Imputation : Mean imputation : Another imputation technique involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable. However, mean imputation attenuates any correlations involving the variable(s) that are imputed.
Median imputation : Similar to mean, median is used to impute the missing values, useful for numerical features.
Mode imputation : Most Frequent is another statistical strategy to impute missing values and YES!! It works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column. 
Interpolation imputation : It tries to estimate values from other observations within the range of a discrete set of known data points.
This method works well for a time series with some trend but is not suitable for seasonal data. 
MICE imputation : This is the one of the most efficient methods which has three steps :
> Imputation â€“ Similar to single imputation, missing values are imputed. However, the imputed values are drawn m times from a distribution rather than just once. At the end of this step, there should be m completed datasets.
> Analysis â€“ Each of the m datasets is analyzed. At the end of this step there should be m analyses.
> Pooling â€“ The m results are consolidated into one result by calculating the mean, variance, and confidence interval of the variable of concern.
Multivariate imputation by chained equations (MICE), sometimes called 'fully conditional specification' or 'sequential regression multiple imputation' has emerged in the statistical literature as one principled method of addressing missing data. Creating multiple imputations, as opposed to single imputations, accounts for the statistical uncertainty in the imputations. 
Random Forests imputation : They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings.

KNN imputation : KNN is an algorithm that is useful for matching a point with its closest k neighbors in a multidimensional space. It can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for dealing with all kind of missing data.
The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables. 
Other Methods using Deep learning can be build to predict the missing values.
Getting started Locally :
Run On Terminal
python m missing.missing <inputFilePath> <outputFilePath>
ex. python m missing.missing C:/Users/DELL/Desktop/train.csv C:/Users/DELL/Desktop/output.csv
Run In IDLE
from missing import missing
m = missing.missing(inputFilePath, outputFilePath)
m.missing_main()
Run on Jupyter
Open terminal (cmd)
jupyter notebook
Create a new python3 file.
from missing import missing
m = missing.missing(inputFilePath, outputFilePath)
m.missing_main()
 NOTE : Please make sure that you have
[statsmodels](https://www.statsmodels.org/stable/install.html)
installed which is used in one of the algorithms for multiple imputations.
OUTPUT :
After analysing and visualizing every possible algorithm against metrics (accuracy, log_loss, recall, precision), The best algorithm is applied for imputing the missing values in the original dataset.
Also , the final dataframe will be written to the output file path you provided.
TESTING :
 The package has been extensively tested on various datasets consisting varied types of expected and unexpected input data and any preprocessing , if required has been taken care of.
Project details
Release history Release notifications
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size missing_python_souravdlboy1.1py3noneany.whl (8.9 kB)  File type Wheel  Python version py3  Upload date  Hashes View 
Filename, size missingpythonsouravdlboy1.1.tar.gz (8.2 kB)  File type Source  Python version None  Upload date  Hashes View 
Hashes for missing_python_souravdlboy1.1py3noneany.whl
Algorithm  Hash digest  

SHA256  d68ed1ebc9e25aed069be91ce0f91f30ef2e535d4dd1093d45e3df167d12bdd9 

MD5  036ad9e0b938c1bc7663ce9c3e30b22e 

BLAKE2256  6ba97ddcbd89e4e6a848a660a5d18f68cbe7c99df4896c8c143dcec3c175aaea 
Hashes for missingpythonsouravdlboy1.1.tar.gz
Algorithm  Hash digest  

SHA256  1138bcd87768f80124be7eae29a2e84b60215d91980a0df669eae5b7505a1c35 

MD5  2e15683faa227cd356e2db60c96423f2 

BLAKE2256  e69b4164a14d2f2db1cf81c2e65d2f854e2405e4b48bc2e3d2e28366a3438a31 