AutoClean - Python Package for Automated Preprocessing & Cleaning of Datasets

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
Topic
- Software Development :: Build Tools

Project description

AutoClean - Automated Data Preprocessing & Cleaning

AutoClean automates data preprocessing & cleaning for your next Data Science project in Python.

Read more on the AutoClean algorithm in my Medium article Automated Data Cleaning withÂ Python.

View the AutoClean project on GitHub.

Description

It is commonly known among Data Scientists that data cleaning and preprocessing make up a major part of a data science project. And, you will probably agree with me that it is not the most exciting part of the project. Wouldn't it be great if this part could be automated?

AutoClean helps you exactly with that: it performs preprocessing and cleaning of data in Python in an automated manner, so that you can save time when working on your next project.

AutoClean supports:

Handling of duplicates [ NEW with version v1.1.0 ]
Various imputation methods for missing values
Handling of outliers
Encoding of categorical data (OneHot, Label)
Extraction of datatime values
and more!

Basic Usage

AutoClean takes a Pandas dataframe as input and has a built-in logic of how to automatically clean and process your data. You can let your dataset run through the default AutoClean pipeline by using:

from AutoClean import AutoClean
pipeline = AutoClean(dataset)

The resulting output dataframe can be accessed by using:

pipeline.output

> Output:
    col_1  col_2  ...  col_n
1   data   data   ...  data
2   data   data   ...  data
... ...    ...    ...  ...

Adjustable Parameters

In some cases, the default settings of AutoClean might not optimally fit your data. Therefore it also supports manual settings so that you can adjust it to whatever processing steps you might need.

It has the following adjustable parameters, for which the options and descriptions can be found below:

AutoClean(dataset, mode='auto', duplicates=False, missing_num=False, missing_categ=False, 
          encode_categ=False, extract_datetime=False, outliers=False, outlier_param=1.5, 
          logfile=True, verbose=False)

Parameter	Type	Default Value	Other Values
mode	`str`	`'auto'`	`'manual'`
duplicates	`str`	`False`	`'auto'`, `True`
missing_num	`str`	`False`	`'auto'`, `'linreg'`, `'knn'`, `'mean'`, `'median'`, `'most_frequent'`, `'delete'`, `False`
missing_categ	`str`	`False`	`'auto'`, `'logreg'`, `'knn'`, `'most_frequent'`, `'delete'`, `False`
encode_categ	`list`	`False`	`'auto'`, `['onehot']`, `['label']`, `False` ; to encode only specific columns add a list of column names or indexes: `['auto', ['col1', 2]]`
extract_datetime	`str`	`False`	`'auto'`, `'D'`, `'M'`, `'Y'`, `'h'`, `'m'`, `'s'`
outliers	`str`	`False`	`'auto'`, `'winz'`, `'delete'`
outlier_param	`int`, `float`	`1.5`	any int or float, `False`
logfile	`bool`	`True`	`False`
verbose	`bool`	`False`	`True`

By setting the mode parameter, you can define in which mode AutoClean will run:

Automated processing (mode = 'auto'): the data will be analyzed and cleaned automatically, by being passed through all the steps in the pipeline. All the parameters are set to = 'auto'.
Manual processing (mode = 'manual'): you can manually define the processing steps that AutoClean will perform. All the parameters are set to False, except the ones that you define individually.

For example, you can choose to only handle outliers in your data, and skip all other processing steps by using:

pipeline = AutoClean(dataset, mode='manual', outliers='auto')

Please see the AutoClean documentation on GitHub for a detailed usage guide and descriptions of the parameters.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
Topic
- Software Development :: Build Tools

Release history Release notifications | RSS feed

This version

1.1.3

Aug 19, 2022

1.1.2

Jul 29, 2022

1.1.1

Jul 29, 2022

1.1.0

Jul 3, 2022

1.1.0b5 pre-release

Jul 3, 2022

1.1.0b4 pre-release

Jul 3, 2022

1.1.0b3 pre-release

Jul 3, 2022

1.1.0b2 pre-release

Jul 3, 2022

1.1.0b0 pre-release

Jul 3, 2022

1.1.0a2 pre-release

Jul 3, 2022

1.0.0

Mar 29, 2022

0.0.10a0 pre-release

Mar 29, 2022

0.0.8a0 pre-release

Mar 29, 2022

0.0.7a0 pre-release

Mar 29, 2022

0.0.6a0 pre-release

Mar 23, 2022

0.0.5a0 pre-release

Mar 23, 2022

0.0.4a0 pre-release

Mar 23, 2022

0.0.3a0 pre-release

Mar 23, 2022

0.0.2a0 pre-release

Mar 23, 2022

0.0.1b0 pre-release

Mar 29, 2022

0.0.1a0 pre-release

Mar 23, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py-AutoClean-1.1.3.tar.gz (9.5 kB view hashes)

Uploaded Aug 19, 2022 Source

Hashes for py-AutoClean-1.1.3.tar.gz

Hashes for py-AutoClean-1.1.3.tar.gz
Algorithm	Hash digest
SHA256	`fdfaf7b59471036397bc24cb34e4e2f8ce532ffb2e76f562b93429a764f63b29`
MD5	`09ff8f0d8e657900ff59c31f35f62636`
BLAKE2b-256	`5d20c3cce583378ef2d416628f90d338705bb64d19f8213abc2179e0fac1b199`