Skip to main content

This package allows users to perform exploratory data analysis tasks and techniques such as creating an EDA PDF report from any dataset

Project description

EDA-assistant

Background

The goal of this project is to help data scientists or data analysts perform easy and quick exploratory data analysis in Python. With the current process for EDA in Python involving importing many packages and writing multiple lines of code, the EDA-assistant package makes this process more simple for the end user with just a single import and two lines of code to produce a PDF report containing all standard EDA summary statistics and graphs. Specifically, the EDA PDF report produced currently contains tables for data set and variable summary statistics calculations, bar graphs for visualizing data distribution, a correlation matrix heat map plot, and a scatter pair plot.

Data

The datasets used in this repository for testing and demonstration are listed along with their sources below:

  1. Iris Flower Dataset
  2. Wine Quality Dataset
  3. Cereal Dataset

Software

Programming Language(s):
Python

Python Packages:
cycler==0.11.0
fonttools==4.29.1
kiwisolver==1.3.2
matplotlib==3.5.1
numpy==1.22.2
packaging==21.3
pandas==1.4.1
Pillow==9.0.1
pyparsing==3.0.7
python-dateutil==2.8.2
pytz==2021.3
scipy==1.8.0
seaborn==0.11.2
six==1.16.0

Package Structure

EDA-assistant/
  |- eda_assistant/
    |- __init__.py
    |- _calc_dataframe_statistics.py
    |- _calc_variable_statistics.py
    |- _create_graphs.py
    |- _create_tables.py
    |- _format_eda_report.py
    |- _format_graphs.py
    |- _format_tables.py
    |- eda_eassistant.py
    |- tests/
      |- __init__.py
      |- test_calc_dataframe_statistics.py
      |- test_calc_variable_statistics.py
      |- test_create_tables.py
      |- test_eda_assistant.py
      |- test_format_graphs.py
      |- test_format_tables.py
      |- test_create_tables_results/
        |- test_create_df_summary_cereal_results.csv
        |- test_create_var_summary_cereal_results.csv
  |- data/
    |- IRIS.csv
    |- WineQT.csv
    |- cereal.csv
  |- docs/
    |- EDA_assistant_final_presentation.pdf
    |- EDA_assistant_written_report.pdf
  |- examples/
    |- demo_EDA_assistant.ipynb
    |- demo_iris_eda_report.pdf
    |- demo_iris_eda_report_cat_hist.png
    |- demo_iris_eda_report_corr.png
    |- demo_iris_eda_report_df_table.png
    |- demo_iris_eda_report_num_hist.png
    |- demo_iris_eda_report_pair.png
    |- demo_iris_eda_report_var_table.png
    |- demo_wine_eda_report.pdf
  |- LICENSE
  |- README.md
  |- requirements.txt
  |- setup.py

Installation

To install this package, simply enter the following command:

pip install EDA-assistant

Assumptions and Dependencies

  • Dataset file to create an EDA class must be in a .csv file format
  • Dataset file to create an EDA class must be in the current working directory for the user
  • The variable types in the dataset are determined with Panda’s dtype function, which may not always identify the correct variable type 100% of the time
  • The categorical bar plots in the EDA report will not be plotted unless the number of unique variables in the categorical column is less than or equal to 10. This is because as the number of bars surpass 10, the bar plot becomes more compressed and thus harder to read
  • The scatter pair plot in the EDA report will not be plotted unless the number of numeric variables in the dataset is less than or equal to 10. This is because as the number of variables surpass 10, the processing time for the plot takes much longer to produce
  • The PDF format of the EDA report may vary widely; the title of the pages may sometimes overlap the title of the graphs or have a large white-space gap between them

Usage

To see how to use the package to create the EDA report, refer to the example notebook

Output Preview

Below contains some screenshots for the sample output of the EDA report created with this package. These tables and graphs seen below are associated with the data set IRIS.csv (source listed above): Data Set Summary Statistics Variable Summary Statistics Numerical Histogram Plots Categorical Histogram Plots Correlation Matrix Scatter Pair Plot

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

EDA-assistant-0.0.3.tar.gz (4.1 kB view hashes)

Uploaded Source

Built Distribution

EDA_assistant-0.0.3-py3-none-any.whl (4.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page