Skip to main content

AutoClean - Python Package for Automated Preprocessing & Cleaning of Datasets

Project description

AutoClean - Automated Data Preprocessing & Cleaning

AutoClean automates data preprocessing & cleaning for your next Data Science project in Python.

Read more on the AutoClean algorithm in my Medium article Automated Data Cleaning with Python.


Description

It is commonly known among Data Scientists that data cleaning and preprocessing make up a major part of a data science project. And, you will probably agree with me that it is not the most exciting part of the project. Wouldn't it be great if this part could be automated?

AutoClean helps you exactly with that: it performs preprocessing and cleaning of data in Python in an automated manner, so that you can save time when working on your next project.

AutoClean supports:

  • Various imputation methods for missing values
  • Handling of outliers
  • Encoding of categorical data (OneHot, Label)
  • Extraction of datatime values
  • and more!

Basic Usage

AutoClean takes a Pandas dataframe as input and has a built-in logic of how to automatically clean and process your data. You can let your dataset run through the default AutoClean pipeline by using:

from AutoClean import AutoClean
pipeline = AutoClean(dataset)

The resulting output dataframe can be accessed by using:

pipeline.output

> Output:
    col_1  col_2  ...  col_n
1   data   data   ...  data
2   data   data   ...  data
... ...    ...    ...  ...

Adjustable Parameters

In some cases, the default settings of AutoClean might not optimally fit your data. Therefore it also supports manual settings so that you can adjust it to whatever processing steps you might need.

It has the following adjustable parameters, for which the options and descriptions can be found below:

AutoClean(dataset, missing_num='auto', missing_categ='auto', encode_categ=['auto'],     
          extract_datetime='s', outliers='winz', outlier_param=1.5, logfile=True, verbose=False)
Parameter Type Default Value Other Values
missing_num str 'auto' linreg, knn, mean, median, most_frequent, delete, False
missing_categ str 'auto' logreg, knn, most_frequent, delete, False
missing_categ list ['auto'] ['onehot'], ['label'], False ; to encode only specific columns add a list of column names or indexes: ['auto', ['col1', 2]]
extract_datetime str 's' D, M, Y, h, m, False
outliers str 'winz' delete
outlier_param int, float 1.5 any int or float, False
logfile bool True False
verbose bool False True

missing_num

Defines how numerical missing values in the data are handled. Missing values can be predicted, imputed or deleted. When set to auto, AutoClean first attempts to predict the missing values with Linear Regression, and the values that could not be predicted are imputed with K-NN.

You can specify the handling method by setting missing_num to: 'linreg', 'knn', 'mean', 'median', 'most_frequent', 'delete' or to False if you want to skip this step.

missing_categ

Defines how categorical missing values in the data are handled. Missing values can be predicted, imputed or deleted. When set to auto, AutoClean first attempts to predict the missing values with Logistic Regression, and the values that could not be predicted are imputed with K-NN.

You can specify the handling method by setting missing_categ to: 'logreg', 'knn', 'most_frequent', 'delete' or to False if you want to skip this step.

extract_datetime

AutoClean can search the data for datetime features, and extract the values to separate columns. When set to s, it extracts the datetime values up to the seconds i. e. day, month, year, hour, minutes, seconds.

You can set the granularity of the extraction manually by setting extract_datetimeto D for day, M for month, Y for year, h for hour, m for minutes or to False if you want to skip this step.

outliers

Defines how outliers in the data are handled. Outliers can be manipulated with two different methods: winsorization or deletion. You can specfiy the method by setting outliers to winz for winzorization, deletefor deletion or to False if you want to skip this step.

When are outliers considered to be outliers?
Oberservations are considered outliers if they are outside the following bounds:

[Q1 - 1.5*IQR , Q3 + 1.5*IQR]

where
... Q1 and Q3 are the first and third quartile of the feature values ... IQR is the interquartile range of the feature values

As soon as a value is below the lower or upper bound, the chosen outlier handling method is applied i. e. either winsorization, meaning it will be replaced by the respective lower or upper bound, or the observation will be deleted.

You can customize the outlier bounds by changing the default outlier_param value of 1.5 to any integer or float of your choice.

outlier_param

! Recommended not to change default value

You can customize the outlier bounds by changing the default outlier_param value of 1.5 to any integer or float of your choice.

logfile

Defines whether a logfile should be generated while the AutoClean process runs. If set to True, it will create a autoclean.log file in your current working directory.

You can view a sample logfile here.

verbose

Defined whether the logfile output should be shown on the console while the AutoClean process runs. Set to True if you want to follow the process logs in real-time.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py-AutoClean-0.0.10a0.tar.gz (10.8 kB view details)

Uploaded Source

File details

Details for the file py-AutoClean-0.0.10a0.tar.gz.

File metadata

  • Download URL: py-AutoClean-0.0.10a0.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.8

File hashes

Hashes for py-AutoClean-0.0.10a0.tar.gz
Algorithm Hash digest
SHA256 f50132a2c6b357543994e81a75fafe9d26eb1d66172e7fd4628db772b9dbc2c5
MD5 b47ccc37a6c72f700f2881ed117a1557
BLAKE2b-256 637db7c122fd01eef00e89b5e526cceb5199a7099650635c4467bb53a8ffbd00

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page