Skip to main content

A simple program to automate exploratory data analysis and reporting.

Project description

eda-report - Automated Exploratory Data Analysis

Binder PyPI version Python 3.7 | 3.9 Documentation Status codecov Code style: black

A Python program to help automate the exploratory data analysis and reporting process.

Input data is processed and analysed using pandas' built-in methods, and graphs are plotted using matplotlib & seaborn. The results are then nicely packaged as a Word (.docx) document using python-docx.

Installation

You can install the package from PyPI using:

pip install eda-report

Basic Usage

1. Graphical User Interface

The eda-report command launches a graphical window to help select and analyse a csv/excel file:

eda-report

screencast of the gui

You will be prompted to set a report title, target variable (optional), graph color and output filename, after which the contents of the input file will be analysed, and the results will be saved in a Word (.docx) document.

2. Command Line Interface

To analyse a file named input.csv, just supply its path to the eda-report command:

eda-report -i input.csv

Or even:

eda-report -i input.csv -o output.docx -c cyan --title 'EDA Report'

For more details on the optional arguments, pass the -h or --help flag to view the help message:

eda-report -h
usage: eda-report [-h] [-i INFILE] [-o OUTFILE] [-t TITLE] [-c COLOR]
                  [-T TARGET]

Automatically analyse data and generate reports. A graphical user interface
will be launched if none of the optional arguments is specified.

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --infile INFILE
                        A .csv or .xlsx file to analyse.
  -o OUTFILE, --outfile OUTFILE
                        The output name for analysis results (default: eda-
                        report.docx)
  -t TITLE, --title TITLE
                        The top level heading for the report (default:
                        Exploratory Data Analysis Report)
  -c COLOR, --color COLOR
                        The color to apply to graphs (default: cyan)
  -T TARGET, --target TARGET
                        The target variable (dependent feature), used to
                        color-code plotted values. An integer value is treated
                        as a column index, whereas a string is treated as a
                        column label.

3. Interactive Mode

3.1 Analyse univariate data

>>> from eda_report.univariate import Variable
>>> Variable(range(20), name="1 to 20")
        Overview
        ========
Name: 1 to 20
Type: numeric
Unique Values: 20 -> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, [...]
Missing Values: None
          ***
      Summary Statistics
                         1 to 20
Number of observations  20.00000
Average                  9.50000
Standard Deviation       5.91608
Minimum                  0.00000
Lower Quartile           4.75000
Median                   9.50000
Upper Quartile          14.25000
Maximum                 19.00000
Skewness                 0.00000
Kurtosis                -1.20000

3.2 Analyse multivariate data

>>> from eda_report.multivariate import MultiVariable
>>> from seaborn import load_dataset
>>> data = load_dataset("iris")
>>> MultiVariable(data)
                        OVERVIEW
                        ========
Numeric features: sepal_length, sepal_width, petal_length, petal_width
Categorical features: species
                          ***
          Summary Statistics (Numeric features)
          -------------------------------------
              count    mean     std  min  25%   50%  75%  max  skewness  kurtosis
sepal_length  150.0  5.8433  0.8281  4.3  5.1  5.80  6.4  7.9    0.3149   -0.5521
sepal_width   150.0  3.0573  0.4359  2.0  2.8  3.00  3.3  4.4    0.3190    0.2282
petal_length  150.0  3.7580  1.7653  1.0  1.6  4.35  5.1  6.9   -0.2749   -1.4021
petal_width   150.0  1.1993  0.7622  0.1  0.3  1.30  1.8  2.5   -0.1030   -1.3406
                          ***
          Summary Statistics (Categorical features)
          -----------------------------------------
        count unique     top freq relative freq
species   150      3  setosa   50        33.33%
                          ***
          Bivariate Analysis (Correlation)
          --------------------------------
petal_length & petal_width --> very strong positive correlation (0.96)
sepal_length & petal_length --> strong positive correlation (0.87)
sepal_length & petal_width --> strong positive correlation (0.82)
sepal_length & sepal_width --> very weak negative correlation (-0.12)
sepal_width & petal_length --> weak negative correlation (-0.43)
sepal_width & petal_width --> weak negative correlation (-0.37)

3.3 Generate a report

>>> from eda_report import get_word_report
>>> from seaborn import load_dataset

>>> data = load_dataset("iris")
>>> get_word_report(data)
Bivariate analysis: 100%|███████████████████████████████████| 6/6 numeric pairs.
Univariate analysis: 100%|███████████████████████████████████| 5/5 features.
[INFO 17:31:37.880] Done. Results saved as 'eda-report.docx'
<eda_report.document.ReportDocument object at 0x7f3040c9bcd0>

Visit the official documentation for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eda_report-2.2.0.tar.gz (42.5 kB view hashes)

Uploaded Source

Built Distribution

eda_report-2.2.0-py3-none-any.whl (42.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page