Skip to main content

An utility to clean the data and return you the cleaned data

Project description

PyPI - Python Version PyPI - License PyPI GitHub repo size

DATA CLEANING

## Description

In any Machine Learning process, Data Preprocessing is the primary step wherein the raw/unclean data are transformed into cleaned data, So that in the later stage, machine learning algorithms can be applied. This python paackage make the data preprocessing very easy in just 2 lines of code. All you have to do is just input a raw data(CSV file), this library will clean your data and return you the cleaned dataframe on which further you can apply feature engineering, feature selection and modeling.

  • What this does?
    • Cleans special character
    • Removes duplicates
    • Fixes abnormality in column names
    • Imputes the data (categorical & numerical)

Data Cleaning

Data-cleaning is a python package for data preprocessing. This cleans the CSV file and returns the cleaned data frame. It does the work of imputation, removing duplicates, replacing special characters, and many more.

How to use:

Step 1: Install the libaray

pip install data-cleaning

Step 2:

Import the library, and specify the path of the csv file.

from datacleaning import DataCleaning

dp = DataCleaning(file_uploaded='filename.csv')
cleaned_df = dp.start_cleaning()

There are some optional parameters that you can specify as listed below,

Usage:

from datacleaning import DataCleaning

DataCleaning(file_uploaded='filename.csv', separator=",", row_threshold=None, col_threshold=None,
         special_character=None, action=None, ignore_columns=None, imputation_type="RDF")

Parameters


Parameter Default Value Limit Example
file_uploaded none Provide a CSV file. filename.csv
separator , Separator used in csv file ;
row_threshold none 0 to 100 80
col_threshold none 0 to 100 80
special_character Check the list below Sspecify the character
that is not listed in default_list (see below)
[ '$' , '?' ]
action none add or remove add
ignore_columns none Provide list of column names
to ignoring the special characters operation.
[ 'column1', 'column2' ]
imputation_type RDF Select your preferred imputation
RDF, KNN, mean, median, most_frequent, constant .
KNN

Examples of using parameters

- Appending extra special characters to the existing default_list

The DEFAULT SPECIAL CHARACTERS included in the package are shown below,

default_list = ["!", '"', "#", "%", "&", "'", "(", ")",
                  "*", "+", ",", "-", ".", "/", ":", ";", "<",
                  "=", ">", "?", "@", "[", "\\", "]", "^", "_",
                  "`", "{", "|", "}", "~", "–", "//", "%*", ":/", ".;", "Ø", "§",'$',"£"]

How to remove a special character, say for example if you want to remove "?" and "%".

Note:- Do not forget to give action = 'remove'

from datacleaning import DataCleaning

dp = DataCleaning(file_uploaded='filename.csv', special_character =['?', '%'], action='remove')
cleaned_df = dp.start_cleaning()

How to add a special character, say for example if you want to add "é" that is not in the default_list given above.

Note:- Do not forget to give action = 'add'

from datacleaning import DataCleaning

dp = DataCleaning(file_uploaded='filename.csv', special_character =['é'], action='add')
cleaned_df = dp.start_cleaning()

- Ignoring a particular columns and adding a special character

Say for example, column named "timestamp" and "date" needs to be removed and a special character needs to be added 'é'

from datacleaning import DataCleaning

dp = DataCleaning(file_uploaded='filename.csv', special_character =['é'],
              action='add', ignore_columns=['timestamp', 'date'])
cleaned_df = dp.start_cleaning()

- Changing threshold to remove null rows/columns above this given threshold value

from datacleaning import DataCleaning

dp = DataCleaning(file_uploaded='filename.csv', row_threshold=50, col_threshold=90)
cleaned_df = dp.start_cleaning()

- Imputation methods available

  • RDF (RandomForest) -> (DEFAULT)
  • KNN (k-nearest neighbors)
  • mean
  • median
  • most_frequent
  • constant
# Example for KNN imputation.
from datacleaning import DataCleaning

dp = DataCleaning(file_uploaded='filename.csv', imputation_type='KNN')
cleaned_df = dp.start_cleaning()

>> THANK YOU <<

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data-cleaning-1.0.1.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_cleaning-1.0.1-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file data-cleaning-1.0.1.tar.gz.

File metadata

  • Download URL: data-cleaning-1.0.1.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.0

File hashes

Hashes for data-cleaning-1.0.1.tar.gz
Algorithm Hash digest
SHA256 5a3aa5cf9aee687d1c4c689dc104b93d2a22bc879b942a3b7c3fa030158e04b7
MD5 a0bbe8e7163ecff06044dc9ed58d520b
BLAKE2b-256 c953163dd000a569d0827b3734e857ed4e16ba08c0c9bf2027e96397a9851dfa

See more details on using hashes here.

File details

Details for the file data_cleaning-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: data_cleaning-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.0

File hashes

Hashes for data_cleaning-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4cbabc7660edb54b57fb098d5724e30b7890a1d1c7c9ffe24cc07813d4129afd
MD5 d1815bf2977b6ba130cf102439351b99
BLAKE2b-256 f659f55f4294578d45a72b496b4e61593824fef924195537c28ab97187d8318a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page