clean_assist is a simple library designed to help data scientists observe a descriptive summary of their DataFrame
Project description
Clean Assist
Clean Assist is a simple library designed to help data scientists observe a summary of any DataFrame they would like to clean.
This library also displays charts to view the normal approximation of your variables.
Clean Assist is composed of 2 functions:
-
clean_assist.table(df, n_rows, n_round)
Displays relevant features to help you on data cleaning and analysis.Parameters
df : DataFrame you would like to analyze
n_rows : Number of variables to display
n_round : Number of decimals to round calculations -
clean_assist.normality(df, list_var, print_img, size_x, size_y, font_size)
Displays histograms to compare the your variables to a normal distribution.Parameters
df : DataFrame you would like to analyze
list_var : Name of columns to analyze in a list format
print_img : input 'y' to print image or 'n' to not print
size_x : width of the image output
size_y : height of the image output
font_size : font size of the titles and headers
To import the library: copy paste the green colored code to your python code:
- Note: Delete the plus(+) signs after pasting code
+ import requests
+ url = 'https://raw.githubusercontent.com/juanduranc/Clean-Assist/master/library'
+ exec(requests.get(url).text)
+ help(clean_assist)
Example of library usage and interpretation:
1. The following table is a sample of an output form the function: clean_assist.table(df, n_rows, n_round)VARIABLES | NULLS | COUNT | TYPES | MEAN | MEDIAN | UNIQUES | SAMPLE_________________________________ | Outliers | pval(Norm) |
---|---|---|---|---|---|---|---|---|---|
AVG_CLICKS_PER_VISIT | 0 | 1946 | int64 | 13.5 | 13.0 | 15 | [11, 13, 12, 13, 13, 17, 10, 13, 12, 12] | [6,0] | 0.03 |
MEDIAN_MEAL_RATING | 47 | 1899 | int64 | 2.8 | 3.0 | 5 | [3, 3, 3, 3, 3, 2, 4, 3, 3, 3] | [0,13] | 3e-06 |
REVENUE | 0 | 1946 | float64 | 2107.3 | 1740.0 | 859 | [1880, 1495, 2572.5, 1647, 1923, 1250] | [0,82] | 1e-21 |
TOTAL_PHOTOS_VIEWED | 0 | 1946 | int64 | 106.4 | 0.0 | 371 | [0, 90, 0, 0, 253, 0, 705, 0, 0, 0] | [0,120] | 5e-90 |
CROSS_SELL_SUCCESS | 0 | 1946 | int64 | 0.7 | 1.0 | 2 | [1, 1, 1, 0, 1, 1, 0, 1, 1, 1] | 1e-159 |
Examples of findings:
- AVG_CLICKS_PER_VISIT has a similar mean and mean, it aproximates a normal distribution and has 6 lower outliers.
- MEDIAN_MEAL_RATING has 47 nulls which need imputation.
- Revenue is the only float variables, the rest are integer.
- TOTAL_PHOTOS_VIEWED has a median of 0 and 120 upper outliers. This means most people dont look view photos.
- CROSS_SELL_SUCCESS has 2 unique values. From the column named sample you can see only ones and zeros. This is a binary or boolean column.
2. Next, a sample output from the function: clean_assist.normality(df, list_var, print_img, size_x, size_y, font_size)
Histograms' interpretation:
- MEDIAN_MEAL_RATING has interger values and it mimisc a normal distribution.
- AVG_CLICKS_PER_VISIT is the colsest variable to a normal distribution with a p value of 0.03.
- REVENUE is right skewed with 82 upper outliers.
- TOTAL_PHOTOS_VIEWED has too many zero values. It is also right skewed and far from being a normal distribution.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.