Skip to main content

AutoClean - Python Package for Automated Preprocessing & Cleaning of Datasets

Project description

AutoClean - Automated Data Preprocessing & Cleaning

AutoClean automates data preprocessing & cleaning for your next Data Science project in Python.

Read more on the AutoClean algorithm in my Medium article Automated Data Cleaning with Python.

View the AutoClean project on GitHub.


Description

It is commonly known among Data Scientists that data cleaning and preprocessing make up a major part of a data science project. And, you will probably agree with me that it is not the most exciting part of the project. Wouldn't it be great if this part could be automated?

AutoClean helps you exactly with that: it performs preprocessing and cleaning of data in Python in an automated manner, so that you can save time when working on your next project.

AutoClean supports:

  • Handling of duplicates
  • Various imputation methods for missing values
  • Handling of outliers
  • Encoding of categorical data (OneHot, Label)
  • Extraction of datatime values
  • and more!

Basic Usage

AutoClean takes a Pandas dataframe as input and has a built-in logic of how to automatically clean and process your data. You can let your dataset run through the default AutoClean pipeline by using:

from AutoClean import AutoClean
pipeline = AutoClean(dataset)

The resulting output dataframe can be accessed by using:

pipeline.output

> Output:
    col_1  col_2  ...  col_n
1   data   data   ...  data
2   data   data   ...  data
... ...    ...    ...  ...

By setting the mode parameters, you can defines in which mode AutoClean will run:

  • Automated processing (mode ='auto'): the data will be analyzed and cleaned automatically by being passed through all the steps in the pipeline. All the parameters are set to = 'auto'.
  • Manual processing (mode ='manual'): you can manually define the processing steps that AutoClean will perform. All the parameters are set to False, except the ones that you defone individually.

For example, you can choose to only handle outliers in your data, and skip all other processing steps by using::

pipeline = AutoClean(dataset, mode='manual', outliers='auto')

Adjustable Parameters

In some cases, the default settings of AutoClean might not optimally fit your data. Therefore it also supports manual settings so that you can adjust it to whatever processing steps you might need.

It has the following adjustable parameters, for which the options and descriptions can be found below:

AutoClean(dataset, mode='auto', missing_num=False, missing_categ=False, encode_categ=False,     
          extract_datetime=False, outliers=False, outlier_param=1.5, 
          logfile=True, verbose=False)
Parameter Type Default Value Other Values
mode str 'auto' 'manual'
missing_num str False 'auto', 'linreg', 'knn', 'mean', 'median', 'most_frequent', 'delete', False
missing_categ str False 'auto', 'logreg', 'knn', 'most_frequent', 'delete', False
encode_categ list False 'auto', ['onehot'], ['label'], False ; to encode only specific columns add a list of column names or indexes: ['auto', ['col1', 2]]
extract_datetime str False 'auto', 'D', 'M', 'Y', 'h', 'm', 's'
outliers str False 'auto', 'winz', 'delete'
outlier_param int, float 1.5 any int or float, False
logfile bool True False
verbose bool False True

Please see the AutoClean documentation on GitHub for detailed descriptions of the parameters.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py-AutoClean-1.1.0a2.tar.gz (11.0 kB view details)

Uploaded Source

File details

Details for the file py-AutoClean-1.1.0a2.tar.gz.

File metadata

  • Download URL: py-AutoClean-1.1.0a2.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.8

File hashes

Hashes for py-AutoClean-1.1.0a2.tar.gz
Algorithm Hash digest
SHA256 04733bb94ba31c407ecf9e634f6aa008cee661a2ccfc5094e0f2ea50a9650b64
MD5 2e3ff248a3865c86474be1828ae181aa
BLAKE2b-256 920f1b16ff623b7583a8ac2ff9ff2f03dbc37720210e965cf659f04a31d1eed9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page