Skip to main content

A simple program to automate exploratory data analysis and reporting.

Project description

Automated Exploratory Data Analysis

Binder PyPI version Python 3.8 Python 3.7 | 3.9 Documentation Status

A Python program to help automate the exploratory data analysis and reporting process.

Input data is processed and analysed using pandas' built-in methods, and graphs are plotted using matplotlib & seaborn. The results are then nicely packaged as a Word (.docx) document using python-docx.

Installation

You can install the package from PyPI using:

pip install eda-report

Basic Usage

1. Graphical User Interface

The eda_report command launches a graphical window to help select and analyse a csv/excel file:

eda_report

screencast of the gui

You will be prompted to set a report title, target variable (optional), graph color and output filename, after which the contents of the input file will be analysed, and the results will be saved in a Word (.docx) document.

2. Interactive Mode

You can obtain a summary for a single feature (univariate) using the Variable class:

>>> from eda_report.univariate import Variable
>>> x = Variable(data=range(50), name='1 to 50')
>>> x
            Overview
            ========
Name: 1 to 50,
Type: numeric,
Unique Values: 50 -> {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, [...],
Missing Values: None

        Summary Statistics
        ==================
                         1 to 50
Number of observations  50.00000
Average                 24.50000
Standard Deviation      14.57738
Minimum                  0.00000
Lower Quartile          12.25000
Median                  24.50000
Upper Quartile          36.75000
Maximum                 49.00000
Skewness                 0.00000
Kurtosis                -1.20000

>>> x.show_graphs()

You can obtain statistics for a set of features (multivariate) using the MultiVariable class:

>>> from eda_report.multivariate import MultiVariable
>>> # Get a dataset
>>> import seaborn as sns
>>> data = sns.load_dataset('iris')
>>> X = MultiVariable(data)
Bivariate analysis: 100%|████████████████████████████████████████████| 6/6 [00:01<00:00,  3.85it/s]
>>> X
        Overview
        ========
Numeric features: sepal_length, sepal_width, petal_length, petal_width
Categorical features: species

        Summary Statistics (Numeric features)
        =====================================
       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

        Summary Statistics (Categorical features)
        =========================================
       species
count      150
unique       3
top     setosa
freq        50

        Bivariate Analysis (Correlation)
        ================================
sepal_length & petal_width --> strong positive correlation (0.82)
sepal_width & petal_width --> weak negative correlation (-0.37)
sepal_length & sepal_width --> very weak negative correlation (-0.12)
sepal_length & petal_length --> strong positive correlation (0.87)
sepal_width & petal_length --> weak negative correlation (-0.43)
petal_length & petal_width --> very strong positive correlation (0.96)

>>> X.show_correlation_heatmap()
>>> # Generate a report document
>>> from eda_report import get_word_report
>>> get_word_report(data)
[INFO 10:56:50.241] Assessing correlation in numeric variables...
Bivariate analysis: 100%|████████████████████████████████████████████| 6/6 [00:01<00:00,  3.89it/s]
[INFO 10:56:53.851] Done. Summarising each variable...
Univariate analysis: 100%|███████████████████████████████████████████| 5/5 [00:01<00:00,  2.52it/s]
[INFO 10:56:56.007] Done. Results saved as 'eda-report.docx' 

3. Command Line Interface

To analyse a file named input.csv, just supply its path to the eda_cli command:

eda_cli input.csv

Or even:

eda_cli input.csv -o output.docx -c cyan --title 'EDA Report'

For more details on the optional arguments, pass the -h or --help flag to view the help message:

eda_cli -h
usage: eda_cli [-h] [-o OUTFILE] [-t TITLE] [-c COLOR] [-T TARGET] infile

Get a basic EDA report in docx format.

positional arguments:
  infile                A .csv or .xlsx file to process.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTFILE, --outfile OUTFILE
                        The output file (default: eda-report.docx)
  -t TITLE, --title TITLE
                        The top level heading in the report (default:
                        Exploratory Data Analysis Report)
  -c COLOR, --color COLOR
                        A valid matplotlib color specifier (default:
                        orangered)
  -T TARGET, --target TARGET
                        The target variable (dependent feature), used to
                        color-code plotted values. An integer value is treated
                        as a column index, whereas a string is treated as a
                        column label. (Default: None)

Visit the official documentation for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eda_report-1.4.0rc0.tar.gz (117.2 kB view hashes)

Uploaded Source

Built Distribution

eda_report-1.4.0rc0-py3-none-any.whl (117.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page