Simple and automatic data cleaning in one line of code! It performs One Hot Encoding, Cleans Dirty/Empty Values, Normalizes values and Removes unwanted columns all in one line of code. Get your data ready for model training and fitting quickly.
Project description
AutoDataCleaner
Simple and automatic data cleaning in one line of code! It performs One Hot Encoding, Cleans Dirty/Empty Values, Normalizes values and Removes unwanted columns all in one line of code. Get your data ready for model training and fitting quickly.
Features
- Uses Pandas DataFrames [So, no need to learn new syntax]
- One-hot encoding: encodes non-numeric values to one-hot encoding columns
- Normalization: performs normalization to columns (excludes binary [1/0] columns)
- Cleans Dirty/None/NA/Empty values: replace None values with mean or mode of a column, delete row that has None cell or substitute None values with pre-defined value
- Delete Unwanted Columns: drop and remove unwanted columns (usually this will be the 'id' column)
Installation
Using pip
pip install AutoDataCleaner
Cloning repo:
Clone repository and run pip uninstall -e .
inside the repository directory
Install from repository directly using pip install git+git://github.com/sinkingtitanic/AutoDataCleaner.git#egg=AutoDataCleaner
Quick One-line Usage:
AutoDataCleaner.clean_me(df,
one_hot=True,
na_cleaner_mode="mean",
normalize=True,
remove_columns=[],
verbose=True)
Example
import pandas as pd
import AutoDataCleaner
df = pd.DataFrame([
[1, "Green", 3],
[2, "Blue", 4],
[3, "Green", 5],
[4, "Green", None]
], columns=['id', 'color', 'weight'])
AutoDataCleaner.clean_me(df, remove_columns=['id']) # see 'Usage' section for more parameters
Example output:
+++++++++++++++ DATA CLEANING STARTED ++++++++++++++++
= DataCleaner: Performing One-Hot encoding...
= DataCleaner: Performing None/NA/Empty values cleaning...
= DataCleaner: Performing dataset normalization...
= DataCleaner: Performing removal of unwanted columns...
+++++++++++++++ DATA CLEANING FINISHED +++++++++++++++
weight color_Blue color_Green
0 -0.855528 0 1
1 -0.475293 1 0
2 -0.095059 0 1
3 1.425880 0 0
Explaining Parameters
AutoDataCleaner.clean_me(df, one_hot=True, na_cleaner_mode="mean", normalize=True, remove_columns=[], verbose=True)
Parameters & what do they mean:
df
: input Pandas DataFrame on which the cleaning will be performedone_hot
: if True, all non-numeric columns will be encoded to one-hot columnsna_cleaner_mode
: what technique to use when dealing with None/NA/Empty values. Modes:
False
: do not consider cleaning na values'remove row'
: removes rows with a cell that has NA value'mean'
: substitues empty NA cells with the mean of that column'mode'
: substitues empty NA cells with the mode of that column'*'
: any other value will substitute empty NA cells with that particular value passed here
normalize
: if True, all non-binray (columns with values 0 or 1 are excluded) columns will be normalized.remove_columns
: list of columns to remove, this is usually non-related featues such as the ID columnverbose
: print progress in terminal/cmdreturns
: processed and clean Pandas DataFrame
Prediction
In prediction phase, put the examples to be predicted in Pandas DataFrame and run them through AutoDataCleaner.clean_me
function with the same parameters you
used during training.
Contribution
Please feel free to send me feedback on "ofcourse7878@gmail.com", submit an issue or make a pull request!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.