A simple program to get a basic EDA report in .docx format.
Project description
Automated Exploratory Data Analysis
A simple Python program to help automate the EDA process.
The data is analysed using pandas' built-in methods, and graphs are plotted using matplotlib & seaborn. The results are then packaged as a .docx file using python-docx.
Installation
You can install the package from PyPI using:
pip install eda-report
Basic Usage
1. Interactive Mode
You can obtain a summary for a single feature (univariate) using:
>>> from eda_report.univariate import Variable
>>> x = Variable(range(50), name='1 to 50')
>>> x
Overview
========
Name: 1 to 50,
Type: numeric,
Unique Values: 50 -> {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, [...],
Missing Values: None
Summary Statistics
==================
1 to 50
Number of observations 50.00000
Average 24.50000
Standard Deviation 14.57738
Minimum 0.00000
Lower Quartile 12.25000
Median 24.50000
Upper Quartile 36.75000
Maximum 49.00000
Skewness 0.00000
Kurtosis -1.20000
>>> x.show_graphs()
You can obtain statistics for a set of features (multivariate) using:
>>> import seaborn as sns
>>> from eda_report.multivariate import MultiVariable
>>> data = sns.load_dataset('iris')
>>> X = MultiVariable(data)
Bivariate analysis: 100%|████████████████████████████████████████████| 6/6 [00:01<00:00, 3.85it/s]
>>> X
Overview
========
Numeric features: sepal_length, sepal_width, petal_length, petal_width
Categorical features: species
Summary Statistics (Numeric features)
=====================================
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Summary Statistics (Categorical features)
=========================================
species
count 150
unique 3
top setosa
freq 50
Bivariate Analysis (Correlation)
================================
sepal_length & petal_width --> strong positive correlation (0.82)
sepal_width & petal_width --> weak negative correlation (-0.37)
sepal_length & sepal_width --> very weak negative correlation (-0.12)
sepal_length & petal_length --> strong positive correlation (0.87)
sepal_width & petal_length --> weak negative correlation (-0.43)
petal_length & petal_width --> very strong positive correlation (0.96)
>>> X.show_correlation_heatmap()
>>> # Generate a report document
>>> from eda_report import get_word_report
>>> get_word_report(data)
[INFO 10:56:50.241] Assessing correlation in numeric variables...
Bivariate analysis: 100%|████████████████████████████████████████████| 6/6 [00:01<00:00, 3.89it/s]
[INFO 10:56:53.851] Done. Summarising each variable...
Univariate analysis: 100%|███████████████████████████████████████████| 5/5 [00:01<00:00, 2.52it/s]
[INFO 10:56:56.007] Done. Results saved as 'eda-report.docx'
2. Graphical User Interface
Use the eda_report
command to launch a graphical window to help select and analyse a csv
/excel
file:
eda_report
You will be prompted to set a report title, graph color and output filename, after which the contents of the input file will be analysed, and the results will be saved in .docx format.
3. Command Line Interface
To analyse a file named input.csv
, just supply its path to the eda_cli
command:
eda_cli input.csv
Or even:
eda_cli input.csv -o output.docx -c cyan --title 'EDA Report'
For more details on the optional arguments, pass the -h
or --help
flag to view the help message:
$ eda_cli -h
usage: eda_cli [-h] [-o OUTFILE] [-t TITLE] [-c COLOR] infile
Get a basic EDA report in docx format.
positional arguments:
infile A .csv or .xlsx file to process.
optional arguments:
-h, --help show this help message and exit
-o OUTFILE, --outfile OUTFILE
The output file (default: eda-report.docx)
-t TITLE, --title TITLE
The top level heading in the report (default: Exploratory Data Analysis Report)
-c COLOR, --color COLOR
A valid matplotlib color specifier (default: orangered)
Visit the official documentation for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for eda_report-1.3.0rc0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 39c3b8faa218f8427c40ef94a9513737a960aaafc03f4a5a23b9784b2c8fbfc0 |
|
MD5 | b757010f7a85c26ac20ca682b77b17fe |
|
BLAKE2b-256 | ec3d3c274598edd8cdaa09ff161cc524b6543f62fd91cfd989d0b2b7e048aec4 |