Skip to main content

Toolset to make EDA easier!

Project description

EDAhelper

Documentation Status codecov github workflow

Tools to make EDA easier!

About

This package is aimed at making the EDA process more effective. Basically, we found there were tons of repetitive work when getting a glimpse of the data set. To stop wasting time in repeating procedures, our team came up with the idea to develop a toolkit that includes the following functions:

  1. Clean the data and replace missing values by using the method preferred.
  2. Provide the description of the data such as the distribution of each column of the data.
  3. Give the correlation plot between different numeric columns automatically.
  4. Combine the plots and make them suitable for the report.

Contributors

  • Rowan Sivanandam
  • Steven Leung
  • Vera Cui
  • Jennifer Hoang

Feature specifications

  1. preprocess(path, method=None, fill_value=None, read_func=pd.read_csv, **kwarg) :
    The function is to preprocess data in txt or csv by dealing with missing values. There are 5 imputation methods provided (None, 'most_frequent', 'mean', 'median', 'constant'). Finally, it will return the processed data as pandas.DataFrame.
  2. column_stats(data, column1, column2 = None, column3 = None, column4 = None) :
    The function is to obtain summary statistics of column(s) including count, mean, median, mode, Q1, Q3, variance, standard deviation, correlation. Finally, it will return summary table detailing all statistics and correlations between chosen columns.
  3. plot_histogram(data, columns=["all"], num_bins=30): :
    The function is to create histograms for numerical features within a dataframe using Altair. Finally, it will return an Altair plot for each specified continuous feature.
  4. numeric_plots(df) :
    The function takes a dataframes and plot the possible pairs of the numeric columns using Altair, creating a matrix of correlation plots.

Related projects

Surely, EDA is not a new topic to data scientists. There are quite a few packages doing similar work on PyPI. However, most of them only include limited functions like just providing descriptive statistics. Our proposal is more of a one-in-all toolkit for EDA. Below is a list of sister-projects.

  • auto-eda : It is an automatic script that generating information in the dataset.
  • easy-eda : Exploratory Data Analysis.
  • quick-eda : Important dataframe statistics with a single command.
  • eda-report : A simple program to automate exploratory data analysis and reporting.

Installation

You can also use Git to clone the repository from GitHub to install the latest development version:

$ git clone https://github.com/UBC-MDS/EDAhelper.git
$ cd dist
$ pip install EDAhelper-3.0.0-py3-none-any.whl

or install from PyPI:

$ pip install edahelper

Usage

Example usage:

from EDAhelper.preprocess import preprocess
EDAhelper.preprocess('file_path')

from EDAhelper.column_stats import column_stats
EDAhelper.column_stats(df, columns = ('Date', 'PctPopulation', 'CrimeRatePerPop'))

from EDAhelper.plot_histogram import plot_histogram
EDAhelper.plot_histogram(df, columns = ['A', 'B'])

from EDAhelper.numeric_plots import numeric_plots
EDAhelper.numeric_plots(df) 

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

EDAhelper was created by Rowan Sivanandam, Steven Leung, Vera Cui, Jennifer Hoang. It is licensed under the terms of the MIT license.

Credits

EDAhelper was created with cookiecutter and the py-pkgs-cookiecutter template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

EDAhelper-1.4.2.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

EDAhelper-1.4.2-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file EDAhelper-1.4.2.tar.gz.

File metadata

  • Download URL: EDAhelper-1.4.2.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for EDAhelper-1.4.2.tar.gz
Algorithm Hash digest
SHA256 2d404fc7083896a18b43b07aa8a5b74fc370a78c80d1ba9161dbd68b3d5b6f29
MD5 7023d16ae72a1e3a282a832b9280a00c
BLAKE2b-256 ea470de0537486a26625aec75d2143ba3ebee5219d0a9139f03a1e137ee0e792

See more details on using hashes here.

File details

Details for the file EDAhelper-1.4.2-py3-none-any.whl.

File metadata

  • Download URL: EDAhelper-1.4.2-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for EDAhelper-1.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 17af69954870bc10c0403b9eccb43f3d96a5832d6918d6596e96d6a259d32084
MD5 3c36e029b7e3288ec6016e3f7c6ba6c0
BLAKE2b-256 dae384b98da295ffaee605386450416c937c6dc348344ab8400bb2bc6fc004b1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page