Skip to main content

Simple and automatic data cleaning in one line of code! It performs One Hot Encoding, Cleans Dirty/Empty Values, Normalizes values and Removes unwanted columns all in one line of code. Get your data ready for model training and fitting quickly.

Project description

AutoDataCleaner

version build python-version

Simple and automatic data cleaning in one line of code! It performs One Hot Encoding, Cleans Dirty/Empty Values, Normalizes values and Removes unwanted columns all in one line of code. Get your data ready for model training and fitting quickly.

Features

  1. Uses Pandas DataFrames [So, no need to learn new syntax]
  2. One-hot encoding: encodes non-numeric values to one-hot encoding columns
  3. Normalization: performs normalization to columns (excludes binary [1/0] columns)
  4. Cleans Dirty/None/NA/Empty values: replace None values with mean or mode of a column, delete row that has None cell or substitute None values with pre-defined value
  5. Delete Unwanted Columns: drop and remove unwanted columns (usually this will be the 'id' column)

Installation

Using pip

pip install AutoDataCleaner

Cloning repo:

Clone repository and run pip uninstall -e . inside the repository directory

Install from repository directly using pip install git+git://github.com/sinkingtitanic/AutoDataCleaner.git#egg=AutoDataCleaner

Quick One-line Usage:

    AutoDataCleaner.clean_me(df, 
                            one_hot=True, 
                            na_cleaner_mode="mean", 
                            normalize=True, 
                            remove_columns=[], 
                            verbose=True)

Example

import pandas as pd
import AutoDataCleaner

df = pd.DataFrame([
                    [1, "Green", 3], 
                    [2, "Blue", 4],
                    [3, "Green", 5], 
                    [4, "Green", None]
                ], columns=['id', 'color', 'weight'])

AutoDataCleaner.clean_me(df, remove_columns=['id']) # see 'Usage' section for more parameters

Example output:

 +++++++++++++++ DATA CLEANING STARTED ++++++++++++++++ 
 = DataCleaner: Performing One-Hot encoding... 
 = DataCleaner: Performing None/NA/Empty values cleaning... 
 = DataCleaner: Performing dataset normalization... 
 = DataCleaner: Performing removal of unwanted columns... 
 +++++++++++++++ DATA CLEANING FINISHED +++++++++++++++ 
	weight 	color_Blue 	color_Green
0 	-0.855528 	0 	1
1 	-0.475293 	1 	0
2 	-0.095059 	0 	1
3 	1.425880 	0 	0

Explaining Parameters

AutoDataCleaner.clean_me(df, one_hot=True, na_cleaner_mode="mean", normalize=True, remove_columns=[], verbose=True)

Parameters & what do they mean:

  • df: input Pandas DataFrame on which the cleaning will be performed
  • one_hot: if True, all non-numeric columns will be encoded to one-hot columns
  • na_cleaner_mode: what technique to use when dealing with None/NA/Empty values. Modes:
    • False: do not consider cleaning na values
    • 'remove row': removes rows with a cell that has NA value
    • 'mean': substitues empty NA cells with the mean of that column
    • 'mode': substitues empty NA cells with the mode of that column
    • '*': any other value will substitute empty NA cells with that particular value passed here
  • normalize: if True, all non-binray (columns with values 0 or 1 are excluded) columns will be normalized.
  • remove_columns: list of columns to remove, this is usually non-related featues such as the ID column
  • verbose: print progress in terminal/cmd
  • returns: processed and clean Pandas DataFrame

Prediction

In prediction phase, put the examples to be predicted in Pandas DataFrame and run them through AutoDataCleaner.clean_me function with the same parameters you used during training.

Contribution

Please feel free to send me feedback on "ofcourse7878@gmail.com", submit an issue or make a pull request!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

AutoDataCleaner-1.0.5.tar.gz (4.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page