Automated EDA reports with quality scores, human-readable insights, and actionable recommendations

These details have not been verified by PyPI

Project links

Project description

AutoEDA+

Automated Exploratory Data Analysis with quality scores, human-readable insights, and actionable preprocessing recommendations.

What is AutoEDA+?

AutoEDA+ is a Python library for exploring Pandas DataFrames and CSV files. It goes beyond raw statistics to provide:

Dataset Quality Score (0-100) with weighted breakdown
Per-Column Health Scores with specific checks and flags
Human-readable insights (e.g., "Salary is heavily right-skewed")
Actionable recommendations (e.g., "Impute Age using Median")

Installation

pip install autoeda-plus

Quick Start

import autoeda as ae
import pandas as pd

# Load your dataset
df = pd.read_csv("titanic.csv")

# Run the complete analysis pipeline
result = ae.analyze(df)

# View the overall quality score
print(result.quality.score)

Features Overview

1. Dataset Overview

Provides rows, columns, memory usage, missing cells, duplicate rows, and column type breakdown.

2. Dataset Quality Score

An overall health score from 0-100 computed from 5 weighted dimensions:

Completeness (30%): Fraction of non-missing cells
Duplicates (20%): Fraction of non-duplicate rows
Outliers (20%): Inverse average outlier rate
Consistency (15%): No mismatched data types
Balance (15%): Category frequency balance

3. Column Health Score

Every column receives a score (0-100) with individual checks:

Correct data type
Missing value evaluation
Outlier detection
Cardinality checks
Skewness detection

4. Human-Readable Insights

Transforms statistical metrics into accessible natural language.

Instead of: Skewness = 2.4
AutoEDA+ says: 'Salary' is highly right-skewed (skewness = 2.40). Most values are concentrated at the lower end with a long upper tail.

5. Smart Recommendations

Provides prioritized rule-based suggestions for data preprocessing.

[HIGH]   Investigate outliers in 'Income'
[MEDIUM] Impute 'Age' using Median
[MEDIUM] Convert 'JoinDate' to datetime
[LOW]    One-Hot Encode 'Gender'

6. Dataset Comparison (Train vs Test)

Compare two datasets to detect Data Drift and Schema Mismatches:

Detects missing columns (e.g., target missing from test set).
Calculates numeric distribution drift using the Kolmogorov-Smirnov (KS) test.
Calculates categorical frequency shifts.
Visualizes differences with overlaid histograms and grouped bar charts.

Complete Public API Reference

Below are all the available API methods. You can easily copy and paste these into your project.

import autoeda as ae

# Full Pipeline Analysis
result = ae.analyze(dataframe)

# Dataset Comparison (Train vs Test)
comp = ae.compare(train_dataframe, test_dataframe)

# Dataset Profiling & Quality
ae.overview(dataframe)
ae.quality(dataframe)
ae.column_health(dataframe)[column_name]

# Statistical DataFrames & Metrics
ae.missing(dataframe)
ae.duplicates(dataframe)
ae.dtypes(dataframe)
ae.statistics(dataframe)
ae.outliers(dataframe)
ae.correlation(dataframe)
ae.distribution(dataframe)

# Insights & Recommendations
ae.insights(dataframe)
ae.recommend(dataframe)

# Visualizations (Standard EDA Plots)
ae.histogram(dataframe, column_name)
ae.boxplot(dataframe, column_name)
ae.countplot(dataframe, column_name)
ae.heatmap(dataframe)
ae.missing_heatmap(dataframe)
ae.scatter(dataframe, x_column_name, y_column_name)

# Visualizations (Dataset Comparison Plots)
ae.compare_histogram(train_dataframe, test_dataframe, column_name)
ae.compare_countplot(train_dataframe, test_dataframe, column_name)

API Examples & Outputs

Dataset Profiling & Quality

ae.overview()
Returns a high-level summary profile of the dataset including shape and memory footprint.

In [1]: ae.overview(df)
Out[1]: DatasetProfile(n_rows=891, n_cols=12, memory='83.7 KB', ...)

ae.quality()
Calculates an overall 0-100 quality score based on completeness, duplicates, and outliers.

In [2]: ae.quality(df)
Out[2]: QualityResult(score=78.5, grade='C+', sub_scores=[...])

ae.column_health()
Returns specific health metrics (0-100 score) and flagged issues for a single column.

In [3]: ae.column_health(df)['Age']
Out[3]: ColumnHealth(score=80.0, missing_pct=19.87, outlier_pct=0.0)

Statistical DataFrames

ae.missing()
Generates a pandas DataFrame listing the exact count and percentage of missing values per column.

In [4]: ae.missing(df).head(2)
Out[4]: 
       missing_count  missing_pct
Cabin            687        77.10
Age              177        19.87

ae.duplicates()
Evaluates the dataset for completely identical rows and returns the count and percentage.

In [5]: ae.duplicates(df)
Out[5]: {'count': 0, 'percentage': 0.0}

ae.dtypes()
Returns a pandas DataFrame of the pandas internal data types of each column.

In [6]: ae.dtypes(df).head(2)
Out[6]: 
             type
PassengerId  int64
Survived     int64

ae.statistics()
Computes core descriptive statistics (mean, standard deviation, min, max) for all numeric columns.

In [7]: ae.statistics(df).head(2)
Out[7]: 
             mean        std   min   max
PassengerId  446.00   257.35     1   891
Survived       0.38     0.48     0     1

ae.outliers()
Detects statistical outliers in numeric columns using the IQR method and returns their frequencies.

In [8]: ae.outliers(df).head(2)
Out[8]: 
       outlier_count  outlier_pct
Fare             116        13.02
SibSp             46         5.16

ae.correlation()
Constructs a Pearson correlation matrix DataFrame mapping relationships between numeric variables.

In [9]: ae.correlation(df).iloc[:2, :2]
Out[9]: 
             PassengerId  Survived
PassengerId     1.000000 -0.005007
Survived       -0.005007  1.000000

Insights & Recommendations

ae.insights()
Analyzes statistical anomalies and formulates human-readable text insights regarding the dataset.

In [10]: ae.insights(df)[0]
Out[10]: "Dataset has 891 rows and 12 columns."

ae.recommend()
Generates a prioritized list of actionable data cleaning steps (e.g. dropping columns, imputing values).

In [11]: ae.recommend(df)[0]
Out[11]: "[HIGH] Drop 'Cabin' due to excessive missing values (77.10%)."

Dataset Comparison (Data Drift)

ae.compare()
Cross-references two datasets to detect schema mismatches and distribution drift (Kolmogorov-Smirnov test).

In [12]: comp = ae.compare(train_df, test_df)
         print(comp.insights[0])
         print([c.column for c in comp.drifted_columns])
Out[12]: 
Drift detected in 'Fare' (KS p-value = 0.0031).
['Fare', 'Age']

Visualizations

When called in a Jupyter Notebook, these visualization functions natively render interactive Plotly graph_objects.Figure charts.

Standard EDA Plots

ae.histogram()
Displays the distribution of a numeric column with a Kernel Density Estimate (KDE) overlay.

In [13]: ae.histogram(df, 'Age')
Out[13]:

Histogram Example

ae.boxplot()
Visualizes data dispersion and isolates statistical outliers for a specified numeric column.

In [14]: ae.boxplot(df, 'Fare')
Out[14]:

Boxplot Example

ae.heatmap()
Renders an interactive correlation matrix heatmap connecting all numeric variables.

In [15]: ae.heatmap(df)
Out[15]:

Heatmap Example

ae.missing_heatmap()
Generates a visual nullity matrix indicating precisely where missing values occur across rows.

In [16]: ae.missing_heatmap(df)
Out[16]:

Missing Heatmap Example

Dataset Comparison Plots

ae.compare_histogram()
Overlays two distributions to visually compare a numeric column between a training and testing set.

In [17]: ae.compare_histogram(train_df, test_df, 'Fare')
Out[17]:

Compare Histogram Example

ae.compare_countplot()
Aligns categorical frequencies side-by-side to visually compare classifications across two datasets.

In [18]: ae.compare_countplot(train_df, test_df, 'Pclass')
Out[18]:

Compare Countplot Example

Dependencies

Package	Version	Purpose
pandas	≥ 1.3	DataFrame operations
numpy	≥ 1.21	Numerical computations
scipy	≥ 1.7	KDE, skewness, kurtosis
plotly	≥ 5.0	Interactive Visualizations

License

MIT License — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Jun 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoeda_plus-1.0.0.tar.gz (22.9 kB view details)

Uploaded Jun 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autoeda_plus-1.0.0-py3-none-any.whl (27.9 kB view details)

Uploaded Jun 26, 2026 Python 3

File details

Details for the file autoeda_plus-1.0.0.tar.gz.

File metadata

Download URL: autoeda_plus-1.0.0.tar.gz
Upload date: Jun 26, 2026
Size: 22.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for autoeda_plus-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`d98ae3626a3412fc8f3d2ecab6d692b5a41f463d3b0cb4891f5bc127af62ccc5`
MD5	`d9936390bddb9dbd98aef5043b013e9e`
BLAKE2b-256	`95db83b68b83a74354d5a1c174f4ffc437ffd5dfc8e3f7ef39347c657b120b57`

See more details on using hashes here.

File details

Details for the file autoeda_plus-1.0.0-py3-none-any.whl.

File metadata

Download URL: autoeda_plus-1.0.0-py3-none-any.whl
Upload date: Jun 26, 2026
Size: 27.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for autoeda_plus-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`570ee81a015b67d47c33ab6c991f985dd4b49ce3f92ba3e24edab3b5c9877caa`
MD5	`b447c243fdd9bb695ed08de511fe24b9`
BLAKE2b-256	`3d503459f7cffd182e3559b92f5799f1da4a9c3b3f7c4f45dd697e60847a6148`

See more details on using hashes here.

autoeda-plus 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AutoEDA+

What is AutoEDA+?

Installation

Quick Start

Features Overview

1. Dataset Overview

2. Dataset Quality Score

3. Column Health Score

4. Human-Readable Insights

5. Smart Recommendations

6. Dataset Comparison (Train vs Test)

Complete Public API Reference

API Examples & Outputs

Dataset Profiling & Quality

Statistical DataFrames

Insights & Recommendations

Dataset Comparison (Data Drift)

Visualizations

Standard EDA Plots

Dataset Comparison Plots

Dependencies

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes