A preliminary EDA helper.
Project description
prelim_eda_helper
This package is a preliminary exploratory data analysis (EDA) tool to make useful feature EDA plots and provide relevant information to simplify an otherwise tedious EDA step of any data science project. Specifically this package allows users to target any two features, whether they are numeric or categorical, and create visualization plots supplemented with useful summary and test statistics.
This package provides a streamlined and easy to use solution for basic EDA tasks that would otherwise require significant amount of coding to achieve. Similar packages can be found published on PyPi such as the following:
prelim_eda_helper
enables user to write quick visualization queries. At the same time, as we understand visually strong effects on graphs are not necessarily statistically meaningful, prelim_eda_helper
is designed to combine graphic visualizations with preliminary statistical test results. We aim to create a helper package to really help researchers to get a quick sense of how our data look like, without making charts and doing tests separately in earlier stages of projects. We believe the combination of graphical and statistical output is what makes prelim_eda_helper
a unique yet handy helper package.
To achive this goal, prelim_eda_helper
creates charts with the visualization library altair
and conducts statistical tests with 'scipy'.
Usage
Installation
$ pip install prelim_eda_helper
initialize_helper
Enables plotting data sets with more than 5000 rows.
initialize_helper()
num_dist_by_cat
Creates a pair of plots showing the distribution of the numeric variable when grouped by the categorical variable. Output includes a histogram and boxplot. In addition, basic test statistics will be provided for user reference.
from prelim_eda_helper import num_dist_by_cat
num_dist_by_cat(num = 'x', cat = 'group', data = data, title_hist = 'Distribution of X', title_boxplot = 'X Seperated by Group', lab_num = 'X', lab_cat = 'Group', num_on_x = True, stat = True)
num_dist_scatter
Creates a scatter plot given two numerical variables. The plot can provide regression trendline and include confidence interval bands. Spearman and Pearson's correlation will also be returned to aid the user to determining feature relationship.
from prelim_eda_helper import num_dist_scatter
num_dist_scatter(num1 = 'x', num2 = 'y', data = data, title = 'Scatter plot with X and Y', stat = False, trend = None)
cat_dist_heatmap
Creates concatenated charts showing the heatmap of two categorical variables and a barchart for occurrance of these variables.
from prelim_eda_helper import cat_dist_heatmap
cat_dist_heatmap(cat_1 = 'group1', cat_2 = 'group2', data = data, title = 'How are Group1 and Group2 distributed?', lab_1 = 'group1', lab_2 = 'group2', heatmap = True, barchart = True)
num_dist_summary
Creates a distribution plot of the given numeric variable and provides a statistical summary of the feature. In addition, the correlation values of the variable with other numeric features will be provided based on a given threshold.
from prelim_eda_helper import num_dist_summary
num_dist_summary(num = 'x', data = data, title = 'Distribution of X', lab = 'X', thresh_corr = 0.0, stat = True )
Contributing
Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.
License
prelim_eda_helper
was created by Mehwish Nabi, Morris Chan, Xinry LU, Austin Shih. It is licensed under the terms of the MIT license.
Credits
prelim_eda_helper
was created with cookiecutter
and the py-pkgs-cookiecutter
template.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for prelim_eda_helper-0.1.8-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52723c0cfbb27a2a91bd616faa1a28af0825d4debbcb860b0bd9c340c9a95a17 |
|
MD5 | f8c2b5f9ccf6879b9200f6a76de0696b |
|
BLAKE2b-256 | b57b87e81d1e71292aac20dc045ca8b277f0239bf84a129be70c91e9b223f171 |