Skip to main content

Slim version of EDA processing Python package

Project description

slimeda

ci-cd

Exploratory Data Analysis is an important preparatory work to help data scientists understand and clean up data sets before machine learning begins. However, this step also involves a lot of repetitive tasks. In this context, slimeda will help data scientists quickly complete the initial work of EDA and gain a preliminary understanding of the data.

Slimeda focuses on unique value and missing value counts, as well as making graphs like histogram and correlation graphs. Also, the generated results are designed as charts or images, which will help users more flexibly reference their EDA results.

Function Specification

The package is under developement and includes the following functions:

  • histogram : This function accepts a dataframe and builds histograms for all numeric columns which are returned as an array of chart objects.

  • corr_map : This function accepts a dataframe and builds an heat map for all numeric columns which is returned as a chart object.

  • cat_unique_count : This function accepts a dataframe and returns a table of unique value counts for all categorical columns.

  • miss_counts : This function accepts a dataframe and returns a table of counts of missing values in all columns.

Limitations: We only consider numeric and categorical columns in our package.

Installation

$ pip install git+https://github.com/UBC-MDS/slimeda

Usage

Slimeda has FOUR functions to help you conduct basic EDA(Exploratory Data Analysis), which includes four basic functions:

  • histogram : The histogram function accepts a data frame as input and a list of columns, and returns a list of charts. Each chart in the output is a histogram Altair object (mark_bar) with the given column on the x-axis and the the count on the y-axis.

    • histogram(example_1, ["Age", "Hobby"])
    • OUTPUT: an Altair histogram object
  • corr_map : Plot the correlation maps based on the provided dataframe and its columns. It will plot the (pairwise) correlation map using the Spearman's rand correlation coefficient.

    • Required parameters in this function:
      • df: a pd.dataframe containing the data of interest
      • columns: the columns of interest
        • Notice that only numeric columns will be included in the final map
    • from vega_datasets import data
    • corr_map(data.cars(), data.cars().columns.to_list())

Output of corr_map

  • cat_unique_counts : Returns the unique count of values in a categorical features and the corresponding feature name.

    • Required parameters in this function:
      • df: a pd.dataframe you want to analyze
    • cat_unique_counts(df)
  • miss_counts : Return the missing value counts and corresponding percentage for a pd.dataframe.

    • There are four parameters in this function:
      • df: a pd.dataframe you want to analyze
      • keyword: Default is None, a single number or string that you want to define as NaN along with original NaNs
      • sparse: a boolean value defaulted as False, meaning don't show columns without null values False
      • ascending: a boolean value defaulted as False, help you to sort the counts ascending or decending.
    • miss_counts(example_1,keyword="miss",sparse=False,ascending=False)
    • OUTPUT:
      • a pd.dataframe as below:

The output of miss_counts

Documentation

Please see the documentation for this package on the Read the Docs link

Fitting in Python Ecosystem

  • Packages have similar functions are:

  • Slimeda's innovation points:

    • We aggregate necessary functions for eda in one function that can only be done with multiple packages and simplify the code. For example, for missing value counts, we not only get the counts but also calculate its percentage.
    • Compared with numpy, we optimize the output to be more clear.
    • Compared with pandas-profiling, we generate the most commonly used graphs and make possible for png outputs, which is much more flexible for users to get their eda results.

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

CONTRIBUTORS

Group 4 members:

  • Khalid Abdilahi (@khalidcawl)
  • Anthea Chen (@anthea98)
  • Simon Guo (@y248guo)
  • Taiwo Owoseni (@thayeylolu)

License

slimeda was created by Simon Guo. It is licensed under the terms of the MIT license.

Credits

slimeda was created with cookiecutter and the py-pkgs-cookiecutter template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slimeda-0.1.6.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

slimeda-0.1.6-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file slimeda-0.1.6.tar.gz.

File metadata

  • Download URL: slimeda-0.1.6.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for slimeda-0.1.6.tar.gz
Algorithm Hash digest
SHA256 ab572422a1f6cc4978c22fba2a5afa147be4942f97937217ceb95ea9f3a0a675
MD5 130b748b0683b77b1fcc3c9d35232d41
BLAKE2b-256 bb490df430810390a83ce526f7a820b0fbfbed5d4dbdb1871d066c725d034f82

See more details on using hashes here.

File details

Details for the file slimeda-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: slimeda-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for slimeda-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 2fa4b2f36afdad4c1aea75c39c9a860692ab0cdd746974ce6798829c3f3c67b0
MD5 eff2b7f16c5318b31c28f722949e3d0f
BLAKE2b-256 27bd13dc0b62448bda9ef920780022246be1e0d96eb3bcc9939349e606d38e95

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page