A simple program to automate exploratory data analysis and reporting.
Project description
Automated Exploratory Data Analysis
A Python program to help automate the exploratory data analysis and reporting process.
Input data is processed and analysed using pandas' built-in methods, and graphs are plotted using matplotlib & seaborn. The results are then nicely packaged as a Word (.docx) document using python-docx.
Installation
You can install the package from PyPI using:
pip install eda-report
Basic Usage
1. Graphical User Interface
The eda_report
command launches a graphical window to help select and analyse a csv
/excel
file:
eda_report
You will be prompted to set a report title, target variable (optional), graph color and output filename, after which the contents of the input file will be analysed, and the results will be saved in a Word (.docx) document.
2. Interactive Mode
You can obtain a summary for a single feature (univariate) using the Variable
class:
>>> from eda_report.univariate import Variable
>>> x = Variable(data=range(50), name='1 to 50')
>>> x
Overview
========
Name: 1 to 50,
Type: numeric,
Unique Values: 50 -> {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, [...],
Missing Values: None
Summary Statistics
==================
1 to 50
Number of observations 50.00000
Average 24.50000
Standard Deviation 14.57738
Minimum 0.00000
Lower Quartile 12.25000
Median 24.50000
Upper Quartile 36.75000
Maximum 49.00000
Skewness 0.00000
Kurtosis -1.20000
>>> x.show_graphs()
You can obtain statistics for a set of features (multivariate) using the MultiVariable
class:
>>> from eda_report.multivariate import MultiVariable
>>> # Get a dataset
>>> import seaborn as sns
>>> data = sns.load_dataset('iris')
>>> X = MultiVariable(data)
Bivariate analysis: 100%|████████████████████████████████████████████| 6/6 [00:01<00:00, 3.85it/s]
>>> X
Overview
========
Numeric features: sepal_length, sepal_width, petal_length, petal_width
Categorical features: species
Summary Statistics (Numeric features)
=====================================
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Summary Statistics (Categorical features)
=========================================
species
count 150
unique 3
top setosa
freq 50
Bivariate Analysis (Correlation)
================================
sepal_length & petal_width --> strong positive correlation (0.82)
sepal_width & petal_width --> weak negative correlation (-0.37)
sepal_length & sepal_width --> very weak negative correlation (-0.12)
sepal_length & petal_length --> strong positive correlation (0.87)
sepal_width & petal_length --> weak negative correlation (-0.43)
petal_length & petal_width --> very strong positive correlation (0.96)
>>> X.show_correlation_heatmap()
>>> # Generate a report document
>>> from eda_report import get_word_report
>>> get_word_report(data)
[INFO 10:56:50.241] Assessing correlation in numeric variables...
Bivariate analysis: 100%|████████████████████████████████████████████| 6/6 [00:01<00:00, 3.89it/s]
[INFO 10:56:53.851] Done. Summarising each variable...
Univariate analysis: 100%|███████████████████████████████████████████| 5/5 [00:01<00:00, 2.52it/s]
[INFO 10:56:56.007] Done. Results saved as 'eda-report.docx'
3. Command Line Interface
To analyse a file named input.csv
, just supply its path to the eda_cli
command:
eda_cli input.csv
Or even:
eda_cli input.csv -o output.docx -c cyan --title 'EDA Report'
For more details on the optional arguments, pass the -h
or --help
flag to view the help message:
eda_cli -h
usage: eda_cli [-h] [-o OUTFILE] [-t TITLE] [-c COLOR] [-T TARGET] infile
Get a basic EDA report in docx format.
positional arguments:
infile A .csv or .xlsx file to process.
optional arguments:
-h, --help show this help message and exit
-o OUTFILE, --outfile OUTFILE
The output file (default: eda-report.docx)
-t TITLE, --title TITLE
The top level heading in the report (default:
Exploratory Data Analysis Report)
-c COLOR, --color COLOR
A valid matplotlib color specifier (default:
orangered)
-T TARGET, --target TARGET
The target variable (dependent feature), used to
color-code plotted values. An integer value is treated
as a column index, whereas a string is treated as a
column label. (Default: None)
Visit the official documentation for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for eda_report-1.4.0rc0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 348e8ec703cc41e8f4a18c6740a8f97c741dece3b001b0b07da0ab4ee57065c1 |
|
MD5 | 1bf9fe69c1c1cedd161705353d8adcfe |
|
BLAKE2b-256 | 35209ec506d853a309693af8af0f38428c90fc48c1cd12dc68824f4b21856da4 |