Skip to main content

This is a python package useful for the automated data cleaning operations.

Project description

This module is about automated data cleaning of datasets for machine learning and deep learning applications. The initial version

comprises of data cleaning operations with regards to the task of classification. Futher releases will incorporate method for

regression and other A.I tasks.

Features

The module deals with the following errors in the dataset automatically:-

* Duplicate rows and columns

* Non-numerical datatypes

* Missing data

* Different ranges of data

The module also returns a summary descirption of the operations that are performed on the dataset during the process of data cleaning.

The module is designed in such a way that it handles errors and exceptions. Any errors while performing an operation doesnt affect the

performance of the other operation.

Handling duplicates

The module will automatically remove the duplicate rows and columns in the dataset.

Non-numerical datatypes

The module will automatically convert the non-numerical datatypes to numerical using labelencoder.

Missing data

The module will remove the rows that have null values, only if the dataset is very large. Or else, it will perform imputation using

statistical measures or machine learning models. If the method is specified by the user, the particular one will be executed, or else, the

default imputation method (standard deviation).

Different ranges of data

The module will deal with this issue based on the type of classification/regression model the user wishes to perform. Generally models

that use gradient descent like linear regression, neural nets and distance based algorithms like

KMeans and KNN require standardisation technique, whereas other algorithms like SVM, naive bayes require normalization technique and

algorithms like decision tree, random forest, bagging and boosting algorithms require no scaling at all.

SYNTAX

from AutoDataCleaner import AutoDataClean

data_cleaning = AutoDataClean.DataCleaner(data, algo=None,method=None,target_name='Outcome',imputer='knn',k_neighbors=5)

To get the data

data=data_cleaning.data

To get the description of the operations performed

print(data_cleaning.descriptions)

Options for the argument algo

'SVM', 'ANN', 'KNN', 'Naive Bayes', 'KMeans', 'LDA' ,'QDA', 'Logistic Regression', 'Decision Tree', 'Random Forest', 'xgboost', 'Gradient

Boosting', 'adaboost'

Options for the argument method

'bfill', 'ffill', 'mean', 'median', 'mode', 'variance', 'std'

Options for the argument imputer

'simple', 'knn'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

AutoCleanPy-0.0.1.tar.gz (4.0 kB view details)

Uploaded Source

Built Distribution

AutoCleanPy-0.0.1-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file AutoCleanPy-0.0.1.tar.gz.

File metadata

  • Download URL: AutoCleanPy-0.0.1.tar.gz
  • Upload date:
  • Size: 4.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for AutoCleanPy-0.0.1.tar.gz
Algorithm Hash digest
SHA256 3b41d57a89468baab00776c0b092dcaf3a0af9808f84ef0fd0405f95c0d6f6f7
MD5 57d845b74273dc228ab606673b60b96a
BLAKE2b-256 02dfe2273b7d3fd1aa21e3efce5df501718ae12114437b0a529ad4ddb3090aae

See more details on using hashes here.

File details

Details for the file AutoCleanPy-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: AutoCleanPy-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 4.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for AutoCleanPy-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c93977470efbb1ac01061378baa7045327f3eb1e37696d045709300b2afa10b8
MD5 28796743c891ff65da9ede73d5698fc8
BLAKE2b-256 38bf0642da1e294c1f4a0748053f7bd44c189357d663f5518f2b266d4e756329

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page