clean_assist is a simple library designed to help data scientists observe a descriptive summary of their DataFrame
Project description
Clean Assist
Clean Assist is a simple library designed to help data scientists observe a summary of any DataFrame they would like to clean.
This library also displays charts to view the normal approximation of your variables.
Clean Assist is composed of 2 functions:
-
clean_assist.table(df, n_rows, n_round)
Displays relevant features to help you on data cleaning and analysis.Parameters
df : DataFrame you would like to analyze
n_rows : Number of variables to display
n_round : Number of decimals to round calculations -
clean_assist.normality(df, list_var, print_img, size_x, size_y, font_size)
Displays histograms to compare the your variables to a normal distribution.Parameters
df : DataFrame you would like to analyze
list_var : Name of columns to analyze in a list format
print_img : input 'y' to print image or 'n' to not print
size_x : width of the image output
size_y : height of the image output
font_size : font size of the titles and headers
To import the library: copy paste the green colored code to your python code:
- Note: Delete the plus(+) signs after pasting code
+ import requests
+ url = 'https://raw.githubusercontent.com/juanduranc/Clean-Assist/master/library'
+ exec(requests.get(url).text)
+ help(clean_assist)
Example of library usage and interpretation:
1. The following table is a sample of an output form the function: clean_assist.table(df, n_rows, n_round)VARIABLES | NULLS | COUNT | TYPES | MEAN | MEDIAN | UNIQUES | SAMPLE_________________________________ | Outliers | pval(Norm) |
---|---|---|---|---|---|---|---|---|---|
AVG_CLICKS_PER_VISIT | 0 | 1946 | int64 | 13.5 | 13.0 | 15 | [11, 13, 12, 13, 13, 17, 10, 13, 12, 12] | [6,0] | 0.03 |
MEDIAN_MEAL_RATING | 47 | 1899 | int64 | 2.8 | 3.0 | 5 | [3, 3, 3, 3, 3, 2, 4, 3, 3, 3] | [0,13] | 3e-06 |
REVENUE | 0 | 1946 | float64 | 2107.3 | 1740.0 | 859 | [1880, 1495, 2572.5, 1647, 1923, 1250] | [0,82] | 1e-21 |
TOTAL_PHOTOS_VIEWED | 0 | 1946 | int64 | 106.4 | 0.0 | 371 | [0, 90, 0, 0, 253, 0, 705, 0, 0, 0] | [0,120] | 5e-90 |
CROSS_SELL_SUCCESS | 0 | 1946 | int64 | 0.7 | 1.0 | 2 | [1, 1, 1, 0, 1, 1, 0, 1, 1, 1] | 1e-159 |
Examples of findings:
- AVG_CLICKS_PER_VISIT has a similar mean and mean, it aproximates a normal distribution and has 6 lower outliers.
- MEDIAN_MEAL_RATING has 47 nulls which need imputation.
- Revenue is the only float variables, the rest are integer.
- TOTAL_PHOTOS_VIEWED has a median of 0 and 120 upper outliers. This means most people dont look view photos.
- CROSS_SELL_SUCCESS has 2 unique values. From the column named sample you can see only ones and zeros. This is a binary or boolean column.
2. Next, a sample output from the function: clean_assist.normality(df, list_var, print_img, size_x, size_y, font_size)
Histograms' interpretation:
- MEDIAN_MEAL_RATING has interger values and it mimisc a normal distribution.
- AVG_CLICKS_PER_VISIT is the colsest variable to a normal distribution with a p value of 0.03.
- REVENUE is right skewed with 82 upper outliers.
- TOTAL_PHOTOS_VIEWED has too many zero values. It is also right skewed and far from being a normal distribution.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file cleanassist-1.3.4.tar.gz
.
File metadata
- Download URL: cleanassist-1.3.4.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0.post20200127 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
ace10a0d3b3d5191502289899f3831476c24331d1b394aaa777e4e74037d9d7e
|
|
MD5 |
911ec0f6026f93fef59a1fe7c49836a9
|
|
BLAKE2b-256 |
e03c5b9dbf213d7bf58c96607fa14ba0f6c198eecee4b3f7fa88dcf4bce10c7b
|