Python library to simply perform dataset cleaning on structured data stored in a Panda's DataFrame automatically with one line of code; to be used prior to training, e.g.: in data pre-processing phase in a machine learning project.
Project description
AutoDataCleaner
Simple and automatic data cleaning in one line of code! It performs One Hot Encoding, Cleans Dirty/Empty Values, Normalizes values and Removes unwanted columns all in one line of code. Get your data ready for model training and fitting quickly.
Features
- Uses Pandas DataFrames [So, no need to learn new syntax]
- One-hot encoding: encodes non-numeric values to one-hot encoding columns
- Normalization: performs normalization to columns (excludes binary [1/0] columns)
- Cleans Dirty/None/NA/Empty values: replace None values with mean or mode of a column, delete row that has None cell or substitute None values with pre-defined value
- Delete Unwanted Columns: drop and remove unwanted columns (usually this will be the 'id' column)
Installation
Using pip
pip install AutoDataCleaner
Cloning repo:
Clone repository and run pip uninstall -e .
inside the repository directory
Install from repository directly using pip install git+git://github.com/sinkingtitanic/AutoDataCleaner.git#egg=AutoDataCleaner
Quick One-line Usage:
AutoDataCleaner.clean_me(df,
one_hot=True,
na_cleaner_mode="mean",
normalize=True,
remove_columns=[],
verbose=True)
Example
import pandas as pd
import AutoDataCleaner
df = pd.DataFrame([
[1, "Green", 3],
[2, "Blue", 4],
[3, "Green", 5],
[4, "Green", None]
], columns=['id', 'color', 'weight'])
AutoDataCleaner.clean_me(df, remove_columns=['id']) # see 'Usage' section for more parameters
Example output:
+++++++++++++++ DATA CLEANING STARTED ++++++++++++++++
= DataCleaner: Performing One-Hot encoding...
= DataCleaner: Performing None/NA/Empty values cleaning...
= DataCleaner: Performing dataset normalization...
= DataCleaner: Performing removal of unwanted columns...
+++++++++++++++ DATA CLEANING FINISHED +++++++++++++++
weight color_Blue color_Green
0 -0.855528 0 1
1 -0.475293 1 0
2 -0.095059 0 1
3 1.425880 0 0
Explaining Parameters
AutoDataCleaner.clean_me(df, one_hot=True, na_cleaner_mode="mean", normalize=True, remove_columns=[], verbose=True)
Parameters & what do they mean:
df
: input Pandas DataFrame on which the cleaning will be performedone_hot
: if True, all non-numeric columns will be encoded to one-hot columnsna_cleaner_mode
: what technique to use when dealing with None/NA/Empty values. Modes:
False
: do not consider cleaning na values'remove row'
: removes rows with a cell that has NA value'mean'
: substitues empty NA cells with the mean of that column'mode'
: substitues empty NA cells with the mode of that column'*'
: any other value will substitute empty NA cells with that particular value passed here
normalize
: if True, all non-binray (columns with values 0 or 1 are excluded) columns will be normalized.remove_columns
: list of columns to remove, this is usually non-related featues such as the ID columnverbose
: print progress in terminal/cmdreturns
: processed and clean Pandas DataFrame
Prediction
In prediction phase, put the examples to be predicted in Pandas DataFrame and run them through AutoDataCleaner.clean_me
function with the same parameters you
used during training.
Contribution
Please feel free to send me feedback on "ofcourse7878@gmail.com", submit an issue or make a pull request!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.