A data cleaning package and visualisation tool for data science projects

These details have not been verified by PyPI

Project links

Homepage

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Datascrubber Module Documentation

Overview

This documentation explains the functions available in the Datascrubber package, which is designed to assist in data cleaning and analytics tasks

Creating the an instance

Through this process, we are calling our class which will help us access the various functions to be used.

from Datascrubber import Datacleaning
data_cleaner = Datacleaning()

OR

from Datascrubber.datacleaning import Datacleaning
data_cleaner = Datacleaning()

So for all the remaining part of our code, we shall be using the data_cleaner.

read_data
columns
head
summary
missing_values
drop_missing_values
col_missing_value
remove_empty_columns
data_types
cat_cols
cont_cols
distributions
data_types
col_dist
cat_dist
col_cat_dist
remove_missingvalues
drop
outliers
outliers_single
remove_outliers
remove_outliers_single
corr_matrix
cont_corr
cont_to_cont
cat_to_cat
countplot
contingency_table
Chi_square
combined_boplot
singleAnova
cont_to_cat
lineplot
getdata
data_cleaning

read_data

Function Name: read_data

This function is used to read data from a file. It supports reading data from a CSV file, Excel file, and a JSON file. The function automatically detects the file type and reads the data accordingly.
Then it gives an explanation of the data that has been read. This includes the number of rows and columns in the data, the number of numeric and categorical columns, and the number of missing values in each column.

Parameters

file_path: The path to the file containing the data. This must be a string input.

Return Value

Returns a dataframe containing the data from the file.

Usage Example

data_clener.read_data("file_path") # replace file_path with the directory of the file.

columns

Function Name: columns

This function returns the columns of the loaded dataset.

Parameters

None

Return Value

Returns a list of column names in the dataset.

Usage Example

data_cleaner.columns()

OR

columns_list = data_cleaner.columns()
print("Columns:", columns_list)

head

Function Name: head

This function is used to display the first few rows of the data. It is useful to get a quick overview of the data.

Parameters

number: The number of rows to display.

Return Value

Returns a dataframe containing the first few rows of the data.

Usage Example

data_cleaner.head(number=5) # replace number with the number of rows to display.

It is also valid not to include numberin parameter and instead just subsititute with an integer or float.

summary

Function Name: summary

This function is used to generate summary statistics of the data. It provides valuable information about the distribution, central tendency, and spread of the data. It calculates statistics for each numeric column in the data.

The statistics provided by the summary function include:

Count: The number of non-null values in the column.
Mean: The arithmetic mean (average) of the values.
Standard Deviation: A measure of the spread or dispersion of the values.
Minimum: The minimum value in the column.
25th Percentile (Q1): The value below which 25% of the data falls.
50th Percentile (Median or Q2): The middle value of the data.
75th Percentile (Q3): The value below which 75% of the data falls.
Maximum: The maximum value in the column.

Then to a categorical column, The summary function generates statistics such as:

Count: The number of non-null values in the column.
Unique: The number of unique categories or levels in the column.
Top: The most frequent category in the column.
Freq: The frequency of the top category.

Parameters

None

Return Value

Returns a dataframe containing the summary statistics of the data.

Usage Example

data_cleaner.summary()

missing_values

Function Name: missing_values

This function is used to check for missing values in the data.

Parameters

None

Return Value

Returns a dataframe containing the number of missing values in each column.

Usage Example

data_cleaner.missing_values()

col_missing_value

Function Name: col_missing_value

This function is used to check for missing values in a specific column.

Parameters

col_name: The name of the column to check for missing values. The column name must be entered as as a string.

Return Value

Returns the number of missing values in the specified column.

Usage Example

data_clener.col_missing_value("col_name") #replace col_name with the column name.

remove_empty_columns

Function Name: remove_empty_columns

This function is used to remove columns that have no values. It is useful to remove columns that have no values as they do not provide any useful information.

Parameters

None

Return Value

None

Usage Example

data_cleaner.remove_empty_columns()

Note: This function is also automatically called by the remove_missingvalues function hence for predictive situations, one can put the column back after cleaning.

data_types

Function Name: data_types

This function is used to check the data types of the columns and the creates subsets of the data based on the data types. It creates a subset of the data containing only the categorical columns and another subset containing only the numeric columns.

Parameters

None

Return Value

None Note: This function maynot necessarily be used as it is called in the background by other functions.

cat_cols

Function Name: cat_cols

This function is used to get the categorical columns in the data.

Parameters

None

Return Value

Returns a dataframe of the categorical columns in the data.

Usage Example

data_cleaner.cat_cols()

cont_cols

Function Name: cont_cols

This function is used to get the numeric columns in the data.

Parameters

None

Return Value

Returns a dataframe of the numeric columns in the data.

Usage Example

data_cleaner.cont_cols()

distributions

Function Name: distributions

This function is used to plot the distribution of the numeric columns in the data. It plots a histogram for each numeric column in the data.

It is useful to get an idea of the distribution of the data. It can be used to identify outliers and skewness in the data.

Parameters

None

Return Value

None

Usage Example

data_cleaner.distributions()

col_dist

Function Name: col_dist

This function is used to plot the distribution of a specific numeric column in the data.

Parameters

col: The name of the column to plot the distribution for.

Return Value

None

Usage Example

data_cleaner.col_dist("col") #replace col with the column name.

cat_dist

Function Name: cat_dist

This function is used to plot the distribution of a all categorical columns in the data.

Parameters

None

Return Value

None

Return Value

None

Usage Example

data_cleaner.cat_dist()

col_cat_dist

Function Name: col_cat_dist

This function is used to plot the distribution of a specific categorical column in the data.

Parameters

col: The name of the column to plot the distribution for.

Return Value

None

Return Value

None

Usage Example

data_cleaner.col_cat_dist("col") #replace col with the column name.

remove_missingvalues

Function Name: remove_missingvalues

This function is used to remove deal with rows that have missing values (NA). The funcion first removes all the duplicates that are within the data and also automatically removes all the empty columns.

The missing values are then replaced with the mode of the column (the most occuring value) for categorical columns.

For numeric columns, the missing values are replaced with either the mean or median of the column depending on the skewness of the data.

Parameters

None

Return Value

None

Usage Example

data_cleaner.remove_missingvalues()

drop_missing_values

Function Name: drop_missing_values

This function is used to remove rows that have missing values (NA). The funcion first removes all the duplicates that are within the data and also automatically removes all the empty columns.

The missing values are totally removed from the data.

Parameters

None

Return Value

None

Usage Example

data_cleaner.drop_missing_values()

drop

Function Name: drop

This function is used to drop columns from the data.

Parameters

column: This is a two way parameter. It can either be a string or a list of strings. If it is a string, it is the name of the column to drop. If it is a list of strings, it is a list of columns to drop.

Return Value

None

Usage Example

data_cleaner.drop("column") #replace column with the column name.

##### OR 
```python

data_cleaner.drop(["column_1","column_2"]) #replace column_1 and column_2 with the column names.

outliers

Function Name: outliers

This function is used to plot the outliers in the data. It plots a boxplot for each numeric column in the data.

It is useful to get an idea of the outliers in the data. It can be used to identify outliers in the data.

Parameters

None

Return Value

None

Usage Example

data_cleaner.outliers()

outliers_single

Function Name: outliers_single

This function is used to plot the outliers in a specific numeric column in the data.

Parameters

column: The name of the numeric column to plot the outliers for.

Return Value

None

Usage Example

data_cleaner.outliers_single("column") #replace column with the column name.

remove_outliers

Function Name: remove_outliers

This function is used to remove outliers from the data. It removes outliers from all the numeric columns in the data.

The concept of outliers is based on the interquartile range (IQR). The IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). The IQR is used to identify outliers by defining limits on the sample values that are a factor k of the IQR below the 25th percentile or above the 75th percentile. The common value for the factor k is the value 1.5. This is the default value used by the function.

Parameters

None

Return Value

None

Usage Example

data_cleaner.remove_outliers()

Note : Depending on the data one is dealing with, the outliers may not be removed completely. Hence one can use alternative methods to remove outliers for example using the imputation with nearest logical values, Transformation, Segmentation and others.

remove_outliers_single_single

remove_outliers_single

Function Name: remove_outliers_single

This function is used to remove outliers from the data. It removes outliers from a specific numeric column in the data.

Parameters

None

Return Value

None

Usage Example

data_cleaner.remove_outliers_single("column") #replace column with the column name.

data_cleaner.remove_outliers_single(["column_1","column_2"]) #replace column_1 and column_2 with the column names.

corr_matrix

Function Name: corr_matrix

This function is used to plot the correlation matrix of the data. It plots a heatmap of the correlation matrix of the data.

It is useful to get an idea of the correlation between the numeric columns in the data. It can be used to identify highly correlated columns in the data.

Parameters

None

Return Value

None

Usage Example

data_clener.corr_matrix()

cont_corr

Function Name: cont_corr

This function is used to plot a pairplot of the numeric columns in the data.

Parameters

None

Return Value

None

Usage Example

data_clener.cont_corr()

cont_to_cont

Function Name: cont_to_cont

The function is used to show significant relationship or difference between two numeric columns in the data. This is achieved through plotting a scatter plot of two numeric columns in the data. The function also goes on to indicate the correlation value between the two columns.

Parameters

col1: This is a two way parameter. It can either be a string or a list of strings. If it is a string, it is the name of the first column to plot. If it is a list of strings, it is a list of columns to plot.
col2: This is a two way parameter. It can either be a string or a list of strings. If it is a string, it is the name of the second column to plot. If it is a list of strings, it is a list of columns to plot.

Return Value

None

Usage Example

data_cleaner.cont_to_cont("col1","col2") #replace col1 and col2 with the column names.

OR

data_cleaner.cont_to_cont("col1",["col2","col3"]) #replace col1, col2 and col3 with the column names.

OR

data_cleaner.cont_to_cont(["col1","col2"],"col3") #replace col1, col2 and col3 with the column names.

OR

data_cleaner.cont_to_cont(["col1","col2"],["col3","col4"]) #replace col1, col2, col3 and col4 with the column names.

cat_to_cat

Function Name: cat_to_cat

The function is used to show significant relationship or difference between two categorical columns in the data. The function hence displays a contingency table of the two categorical columns in the data. and also plots a comparative bar graph of the two columns.

Parameters

col1: This is a two way parameter. It can either be a string or a list of strings. If it is a string, it is the name of the first column to plot. If it is a list of strings, it is a list of columns to plot.
col2: This is a two way parameter. It can either be a string or a list of strings. If it is a string, it is the name of the second column to plot. If it is a list of strings, it is a list of columns to plot.

Return Value

None

Usage Example

data_cleaner.cat_to_cat("col1","col2") #replace col1 and col2 with the column names.

##### OR
```python

data_cleaner.cat_to_cat("col1",["col2","col3"]) #replace col1, col2 and col3 with the column names.

##### OR
```python

data_cleaner.cat_to_cat(["col1","col2"],"col3") #replace col1, col2 and col3 with the column names.

##### OR
```python

data_cleaner.cat_to_cat(["col1","col2"],["col3","col4"]) #replace col1, col2, col3 and col4 with the column names.

countplot

Function Name: countplot

The function is used to plot a countplot of a two categorical columns in the data. This is a way of showing the distribution of the two categorical columns in the data.

Parameters

col1: This is a string. It is the name of the first column to plot.
col2: This is a string. It is the name of the second column to plot.

Return Value

None

Usage Example

data_cleaner.countplot("col1","col2") #replace col1 and col2 with the column names.

contingency_table

Function Name: contingency_table

Parameters

col1: This is a string. It is the name of the first column to plot.
col2: This is a string. It is the name of the second column to plot.

Return Value

None

Usage Example

data_cleaner.contingency_table("col1","col2") #replace col1 and col2 with the column names.

Chi_square

Function Name: Chi_square

The function tests for a statistically significant relationship between nominal and ordinal variables. In other words, it tells us whether two variables are independent of one another.

Parameters

col1: This is a string. It is the name of the first column categorical column.
col2: This is a string. It is the name of the second column categorical column.

Return Value

Chi_square value
The p-value
The degrees of freedom
A string indicating whether the two columns are independent or not.

Usage Example

data_cleaner.Chi_square("col1","col2") #replace col1 and col2 with the column names.

combined_boplot

Function Name: combined_boplot

The function is used to plot a set of side by side box plots, one for each of the categories.

Parameters

col1: This is a string. It is the name of the first column, categorical column.
col2: This is a string. It is the name of the second column, continuous column.

Return Value

None

Usage Example

data_cleaner.combined_boxplot("col1", "col2") #replace col1 and col2 with the column names.

singleAnova

Function Name: singleAnova

The function is used to test for a statistically significant difference between the means of two or more groups.

Parameters

col1: This is a string. It is the name of the first column continuous column.
col2: This is a string. It is the name of the second column categorical column.

Return Value

A string indicating whether the two columns are independent or not.

Usage Example

data_cleaner.singleAnova("col1", "col2") #replace col1 and col2 with the column names.

cont_to_cat

Function Name: cont_to_cat

The function is used to show significant relationship or difference between a continuous and a categorical column in the data.
The function hence displays a side by side boxplot of the continuous column and a categorical column in the data.

Parameters

col1: This is two way parameter. It can either be a string or a list of strings. If it is a string, it is the name of the first column continuous column. If it is a list of strings, it is a list of columns to plot.On the other hand, it can be a a string or a list of strings of categorical columns.
col2: This is two way parameter. It can either be a string or a list of strings. If it is a string, it is the name of the second column categorical column. If it is a list of strings, it is a list of columns to plot.On the other hand, it can be a a string or a list of strings of continuous columns.

Return Value

A string indicating whether the two columns are independent or not.

Usage Example

data_cleaner.cont_to_cat("col1","col2") #replace col1 and col2 with the column names.

OR

data_cleaner.cont_to_cat("col1",["col2","col3"]) #replace col1, col2 and col3 with the column names.

OR

data_cleaner.cont_to_cat(["col1","col2"],"col3") #replace col1, col2 and col3 with the column names.

OR

data_cleaner.cont_to_cat(["col1","col2"],["col3","col4"]) #replace col1, col2, col3 and col4 with the column names.

lineplot

Function Name: lineplot

The function is used to plot a lineplot of a continous column against a categorical column in the data. This is a way of showing the distribution of the two continuous columns in the data.

Parameters

col1: This is a string. It is the name of the first column to plot.
col2: This is a string. It is the name of the second column to plot.

Return Value

None

Usage Example

data_cleaner.lineplot("col1","col2") #replace col1 and col2 with the column names.

data_cleaner.lineplot("col1",["col2","col3"]) #replace col1, col2 and col3 with the column names.

data_cleaner.lineplot(["col1","col2"],"col3") #replace col1, col2 and col3 with the column names.

data_cleaner.lineplot(["col1","col2"],["col3","col4"]) #replace col1, col2, col3 and col4 with the column names.

getdata

Function Name: getdata

The function returns the data that has been cleaned and preprocessed.

Parameters

None

Return Value

Returns a dataframe containing the cleaned data.

Usage Example

data = data_cleaner.getdata()
data.head()

Note : This method can be used to access the data at any step after achieving any required process.

savedata

Function Name: data_cleaning

The function is used to download the preprocessed file with the same extension as the entered file.

Parameters

None

Return Value

None

Usage Example

data = data_cleaner.savedata()
data.head()

data_cleaning

Function Name: data_cleaning

The function is used to clean the data. It performs the following operations:

Removes empty columns.
Removes duplicate rows.
Deals with missing values appropriately.
Removes outliers.

Parameters

None

Return Value

A dataframe containing the cleaned data.

Usage Example

data = data_cleaner.data_cleaning()
data.head()

Project details

These details have not been verified by PyPI

Project links

Homepage

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.5

Nov 10, 2023

0.1.4

Oct 29, 2023

0.1.3

Oct 29, 2023

0.1.2

Oct 29, 2023

0.1.1

Oct 29, 2023

0.1.0

Oct 20, 2023

0.0.9

Oct 20, 2023

0.0.8

Sep 15, 2023

0.0.7

Sep 9, 2023

0.0.6

Sep 7, 2023

0.0.5

Aug 20, 2023

0.0.4

Aug 20, 2023

0.0.3

Aug 20, 2023

0.0.1

Aug 20, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datascrubber-0.1.5.tar.gz (2.8 MB view hashes)

Uploaded Nov 10, 2023 Source

Built Distribution

datascrubber-0.1.5-py3-none-any.whl (13.4 kB view hashes)

Uploaded Nov 10, 2023 Python 3

Hashes for datascrubber-0.1.5.tar.gz

Hashes for datascrubber-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`e7be39d09314034ae2449fab50ff018c9cee9e5e63d4e8fc57d694e63d3f064d`
MD5	`9b3003e8efbd8df577bf62c47720ca0f`
BLAKE2b-256	`f69f75b90196e146a9ac36e8a552b82a85861c2370792600f125ef17160e8317`

Hashes for datascrubber-0.1.5-py3-none-any.whl

Hashes for datascrubber-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e7d3474b66024241d4bf1754bcb759203502e9f154e89f3a669bd32acc286b47`
MD5	`5ab2c0eee32ed0864a5275caa474b2ce`
BLAKE2b-256	`0b919e5acdd4fde7f7b73d0adeac97e2814acec06d004de4cb475681dd0ccaff`

Datascrubber 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Datascrubber Module Documentation

Overview

Creating the an instance

OR

Table of Contents

read_data

Function Name: read_data

Parameters

Return Value

Usage Example

columns

Function Name: columns

Parameters

Return Value

Usage Example

OR

head

Function Name: head

Parameters

Return Value

Usage Example

summary

Function Name: summary

Parameters

Return Value

Usage Example

missing_values

Function Name: missing_values

Parameters

Return Value

Usage Example

col_missing_value

Function Name: col_missing_value

Parameters

Return Value

Usage Example

remove_empty_columns

Function Name: remove_empty_columns

Parameters

Return Value

Usage Example

data_types

Function Name: data_types

Parameters

Return Value

cat_cols

Function Name: cat_cols

Parameters

Return Value

Usage Example

cont_cols

Function Name: cont_cols

Parameters

Return Value

Usage Example

distributions

Function Name: distributions

Parameters

Return Value

Usage Example

col_dist

Function Name: col_dist

Parameters

Return Value

Usage Example

cat_dist

Function Name: cat_dist

Parameters

Return Value

Return Value

Usage Example