EBasic EDA functions implemented
Project description
eda_mds: Simplified Exploratory Data Analysis
Basic EDA functions implemented to improve on core Pandas DataFrame functions.
Summary
This package is created for kick-starting the EDA stage of a machine learning and analytics project.
It's primary objective is to improve upon the popular EDA functions present in pandas
package.
There are four functions that deliver insights and identify potential problems in the dataset.
Function Descriptions
cor_eda
: This function accepts a dataset and isolates its numerical continuous variables. It calculates the correlation between each numerically continuous variable from scratch and displays the results in a table.info_na
: This function replicates and extends behaviour of pandas.DataFrame.info. Additional information about null values in rows and columns is included.cat_var_stats
: This function creates summary statistics about categorical variables in the dataframe. Number of unique values, frequency of values, and suggested category binning is included.describe_outliers
: This function extends the functionality of pandas.Dataframe.describe for numeric data by providing a count of lower-tail and upper-tail outliers for a given threshold.
Python Ecosystem Integration
Our functions are heavily inspired from pandas
package for python.
EDA functions such as pandas.Dataframe.info
, pandas.Dataframe.describe
and pandas.Dataframe.corr
are recreated and improved upon in this package.
Our functions also depend on the pandas.Dataframe
object.
Installation
User Setup
This package can be installed via PyPi by running the following command in your terminal.
$ pip install eda_mds
Developer Setup
Here's how to install eda_mds
for local development.
-
Clone a copy of
eda_mds
locally, by running the following command in your terminal.$ git clone https://github.com/UBC-MDS/eda_mds.git
-
Create/activate new
conda
environment and install poetry.$ conda create -n eda_mds_dev python=3.9 poetry
$ conda activate eda_mds_dev
-
Navigate to the root directory.
$ cd path/to/eda_mds
-
Install
eda_mds
usingpoetry
.$ poetry install
Running the Tests and Coverage
- To run the tests navigate to the root directory.
$ cd path/to/eda_mds
- To run the tests navigate to the root directory.
$ pytest
- To run the coverage report.
$ coverage report
Usage
Function Usage
Below provides a short depiction on how to start using the functions in this package, after you have completed the installation. Please see the vingette for detailed usage. Note: Each function takes in a pandas.DataFrame
object.
- Import the functions and
pandas
.
from eda_mds import info_na, describe_outliers, cat_var_stats, cor_eda
import pandas as pd
- Load your dataset of choice.
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')
- Begin using the functions!
info_na(df)
describe_outliers(df)
cat_var_stats(df)
cor_eda(df)
Contributing
Package created by Koray Tecimer, Paolo De Lagrave-Codina, Nicole Bidwell, Simon Frew.
Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.
License
eda_mds
was created by Koray Tecimer, Paolo De Lagrave-Codina, Nicole Bidwell, Simon Frew.
Code is licensed under the terms of the MIT license.
Non-code portions, specifically vignettes and related documentation, is licensed under the terms of the Creative Commons Zero v1.0 Universal license.
Credits
eda_mds
was created with cookiecutter
and
the py-pkgs-cookiecutter
template.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.