Skip to main content

clean_assist is a simple library designed to help data scientists observe a descriptive summary of their DataFrame

Project description

Clean Assist

Clean Assist is a simple library designed to help data scientists observe a summary of any DataFrame they would like to clean. This library also displays charts to view the normal approximation of your variables.

Clean Assist is composed of 2 functions:

  1. clean_assist.table(df, n_rows, n_round)

    Displays relevant features to help you on data cleaning and analysis.

    Parameters
    df            : DataFrame you would like to analyze
    n_rows    : Number of variables to display
    n_round   : Number of decimals to round calculations

  2. clean_assist.normality(df, list_var, print_img, size_x, size_y, font_size)

    Displays histograms to compare the your variables to a normal distribution.

    Parameters
    df              : DataFrame you would like to analyze
    list_var    : Name of columns to analyze in a list format
    print_img    : input 'y' to print image or 'n' to not print
    size_x         : width of the image output
    size_y         : height of the image output
    font_size    : font size of the titles and headers

To import the library: copy paste the green colored code to your python code:

- Note: Delete the plus(+) signs after pasting code
+ import requests
+ url = 'https://raw.githubusercontent.com/juanduranc/Clean-Assist/master/library'
+ exec(requests.get(url).text)
+ help(clean_assist)

Example of library usage and interpretation:

1. The following table is a sample of an output form the function: clean_assist.table(df, n_rows, n_round)

VARIABLES NULLS COUNT TYPES MEAN MEDIAN UNIQUES SAMPLE_________________________________ Outliers pval(Norm)
AVG_CLICKS_PER_VISIT 0 1946 int64 13.5 13.0 15 [11, 13, 12, 13, 13, 17, 10, 13, 12, 12] [6,0] 0.03
MEDIAN_MEAL_RATING 47 1899 int64 2.8 3.0 5 [3, 3, 3, 3, 3, 2, 4, 3, 3, 3] [0,13] 3e-06
REVENUE 0 1946 float64 2107.3 1740.0 859 [1880, 1495, 2572.5, 1647, 1923, 1250] [0,82] 1e-21
TOTAL_PHOTOS_VIEWED 0 1946 int64 106.4 0.0 371 [0, 90, 0, 0, 253, 0, 705, 0, 0, 0] [0,120] 5e-90
CROSS_SELL_SUCCESS 0 1946 int64 0.7 1.0 2 [1, 1, 1, 0, 1, 1, 0, 1, 1, 1] 1e-159

Examples of findings:
  • AVG_CLICKS_PER_VISIT has a similar mean and mean, it aproximates a normal distribution and has 6 lower outliers.
  • MEDIAN_MEAL_RATING has 47 nulls which need imputation.
  • Revenue is the only float variables, the rest are integer.
  • TOTAL_PHOTOS_VIEWED has a median of 0 and 120 upper outliers. This means most people dont look view photos.
  • CROSS_SELL_SUCCESS has 2 unique values. From the column named sample you can see only ones and zeros. This is a binary or boolean column.

2. Next, a sample output from the function: clean_assist.normality(df, list_var, print_img, size_x, size_y, font_size)


Histograms' interpretation:
  • MEDIAN_MEAL_RATING has interger values and it mimisc a normal distribution.
  • AVG_CLICKS_PER_VISIT is the colsest variable to a normal distribution with a p value of 0.03.
  • REVENUE is right skewed with 82 upper outliers.
  • TOTAL_PHOTOS_VIEWED has too many zero values. It is also right skewed and far from being a normal distribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanassist-1.3.4.tar.gz (5.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page