This is a python package useful for the automated data cleaning operations.
Project description
This module is about automated data cleaning of datasets for machine learning and deep learning applications. The initial version
comprises of data cleaning operations with regards to the task of classification. Futher releases will incorporate method for
regression and other A.I tasks.
Features
The module deals with the following errors in the dataset automatically:-
* Duplicate rows and columns
* Non-numerical datatypes
* Missing data
* Different ranges of data
The module also returns a summary descirption of the operations that are performed on the dataset during the process of data cleaning.
The module is designed in such a way that it handles errors and exceptions. Any errors while performing an operation doesnt affect the
performance of the other operation.
Handling duplicates
The module will automatically remove the duplicate rows and columns in the dataset.
Non-numerical datatypes
The module will automatically convert the non-numerical datatypes to numerical using labelencoder.
Missing data
The module will remove the rows that have null values, only if the dataset is very large. Or else, it will perform imputation using
statistical measures or machine learning models. If the method is specified by the user, the particular one will be executed, or else, the
default imputation method (standard deviation).
Different ranges of data
The module will deal with this issue based on the type of classification/regression model the user wishes to perform. Generally models
that use gradient descent like linear regression, neural nets and distance based algorithms like
KMeans and KNN require standardisation technique, whereas other algorithms like SVM, naive bayes require normalization technique and
algorithms like decision tree, random forest, bagging and boosting algorithms require no scaling at all.
SYNTAX
from AutoDataCleaner import AutoDataClean
data_cleaning = AutoDataClean.DataCleaner(data, algo=None,method=None,target_name='Outcome',imputer='knn',k_neighbors=5)
To get the data
data=data_cleaning.data
To get the description of the operations performed
print(data_cleaning.descriptions)
Options for the argument algo
'SVM', 'ANN', 'KNN', 'Naive Bayes', 'KMeans', 'LDA' ,'QDA', 'Logistic Regression', 'Decision Tree', 'Random Forest', 'xgboost', 'Gradient
Boosting', 'adaboost'
Options for the argument method
'bfill', 'ffill', 'mean', 'median', 'mode', 'variance', 'std'
Options for the argument imputer
'simple', 'knn'
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file AutoCleanPy-0.0.1.tar.gz
.
File metadata
- Download URL: AutoCleanPy-0.0.1.tar.gz
- Upload date:
- Size: 4.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3b41d57a89468baab00776c0b092dcaf3a0af9808f84ef0fd0405f95c0d6f6f7 |
|
MD5 | 57d845b74273dc228ab606673b60b96a |
|
BLAKE2b-256 | 02dfe2273b7d3fd1aa21e3efce5df501718ae12114437b0a529ad4ddb3090aae |
File details
Details for the file AutoCleanPy-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: AutoCleanPy-0.0.1-py3-none-any.whl
- Upload date:
- Size: 4.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c93977470efbb1ac01061378baa7045327f3eb1e37696d045709300b2afa10b8 |
|
MD5 | 28796743c891ff65da9ede73d5698fc8 |
|
BLAKE2b-256 | 38bf0642da1e294c1f4a0748053f7bd44c189357d663f5518f2b266d4e756329 |