Skip to main content

clean_assist is a simple library designed to help data scientists observe a descriptive summary of their DataFrame

Project description

Clean Assist

Clean Assist is a simple library designed to help data scientists observe a summary of any DataFrame they would like to clean. This library also displays charts to view the normal approximation of your variables.

Clean Assist is composed of 2 functions:

  1. clean_assist.table(df, n_rows, n_round)

    Displays relevant features to help you on data cleaning and analysis.

    Parameters
    df            : DataFrame you would like to analyze
    n_rows    : Number of variables to display
    n_round   : Number of decimals to round calculations

  2. clean_assist.normality(df, list_var, print_img, size_x, size_y, font_size)

    Displays histograms to compare the your variables to a normal distribution.

    Parameters
    df              : DataFrame you would like to analyze
    list_var    : Name of columns to analyze in a list format
    print_img    : input 'y' to print image or 'n' to not print
    size_x         : width of the image output
    size_y         : height of the image output
    font_size    : font size of the titles and headers

To import the library: copy paste the green colored code to your python code:

- Note: Delete the plus(+) signs after pasting code
+ import requests
+ url = 'https://raw.githubusercontent.com/juanduranc/Clean-Assist/master/library'
+ exec(requests.get(url).text)
+ help(clean_assist)

Example of library usage and interpretation:

1. The following table is a sample of an output form the function: clean_assist.table(df, n_rows, n_round)

VARIABLES NULLS COUNT TYPES MEAN MEDIAN UNIQUES SAMPLE_________________________________ Outliers pval(Norm)
AVG_CLICKS_PER_VISIT 0 1946 int64 13.5 13.0 15 [11, 13, 12, 13, 13, 17, 10, 13, 12, 12] [6,0] 0.03
MEDIAN_MEAL_RATING 47 1899 int64 2.8 3.0 5 [3, 3, 3, 3, 3, 2, 4, 3, 3, 3] [0,13] 3e-06
REVENUE 0 1946 float64 2107.3 1740.0 859 [1880, 1495, 2572.5, 1647, 1923, 1250] [0,82] 1e-21
TOTAL_PHOTOS_VIEWED 0 1946 int64 106.4 0.0 371 [0, 90, 0, 0, 253, 0, 705, 0, 0, 0] [0,120] 5e-90
CROSS_SELL_SUCCESS 0 1946 int64 0.7 1.0 2 [1, 1, 1, 0, 1, 1, 0, 1, 1, 1] 1e-159

Examples of findings:
  • AVG_CLICKS_PER_VISIT has a similar mean and mean, it aproximates a normal distribution and has 6 lower outliers.
  • MEDIAN_MEAL_RATING has 47 nulls which need imputation.
  • Revenue is the only float variables, the rest are integer.
  • TOTAL_PHOTOS_VIEWED has a median of 0 and 120 upper outliers. This means most people dont look view photos.
  • CROSS_SELL_SUCCESS has 2 unique values. From the column named sample you can see only ones and zeros. This is a binary or boolean column.

2. Next, a sample output from the function: clean_assist.normality(df, list_var, print_img, size_x, size_y, font_size)


Histograms' interpretation:
  • MEDIAN_MEAL_RATING has interger values and it mimisc a normal distribution.
  • AVG_CLICKS_PER_VISIT is the colsest variable to a normal distribution with a p value of 0.03.
  • REVENUE is right skewed with 82 upper outliers.
  • TOTAL_PHOTOS_VIEWED has too many zero values. It is also right skewed and far from being a normal distribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanassist-1.3.4.tar.gz (5.2 kB view details)

Uploaded Source

File details

Details for the file cleanassist-1.3.4.tar.gz.

File metadata

  • Download URL: cleanassist-1.3.4.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0.post20200127 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.7.6

File hashes

Hashes for cleanassist-1.3.4.tar.gz
Algorithm Hash digest
SHA256 ace10a0d3b3d5191502289899f3831476c24331d1b394aaa777e4e74037d9d7e
MD5 911ec0f6026f93fef59a1fe7c49836a9
BLAKE2b-256 e03c5b9dbf213d7bf58c96607fa14ba0f6c198eecee4b3f7fa88dcf4bce10c7b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page