Skip to main content

Comprehensive Exploratory Data Analysis Pipeline

Project description

EDAPipeline: Automated Exploratory Data Analysis Toolkit

EDAPipeline is a comprehensive and automated Python toolkit designed to streamline and simplify the Exploratory Data Analysis (EDA) process. This library helps data scientists, analysts, and engineers efficiently understand data distributions, detect outliers, visualize relationships, and uncover meaningful insights from datasets—all with minimal code.

Key Features

  • Automatic Data Type Detection: Automatically categorizes features into numerical, categorical, and datetime types.
  • Comprehensive Data Overview: Quickly summarize dataset shape, data types, missing values, and memory usage.
  • Missing Value Analysis: Visualizes and reports missing data, highlighting areas needing attention.
  • Advanced Univariate Analysis:
    • Numerical Features: Statistical summaries, normality tests, histograms, KDE plots, box plots, and Q-Q plots.
    • Categorical Features: Counts, percentages, bar plots, and pie charts for clear categorical distribution analysis.
    • Datetime Features: Time-series component analysis, including trends across years, months, weekdays, and hourly distributions.
  • Correlation Analysis: Provides correlation heatmaps and pair plots to uncover relationships between numerical features.
  • Robust Bivariate Analysis: Detailed plots and analyses for numerical-numerical and numerical-categorical feature interactions.
  • Outlier Detection: Implements Z-score and Interquartile Range (IQR) methods to identify outliers effectively.

Installation

Install EDAPipeline via pip:

pip install edapipeline

Usage

Here is a basic example demonstrating how to quickly set up and run a complete EDA analysis:

import pandas as pd
from edapipeline import EDAPipeline

# Load your dataset
df = pd.read_csv('your_dataset.csv')

# Initialize the pipeline
eda = EDAPipeline(df=df, target_col='target_variable')

# Run the complete EDA analysis
eda.run_complete_analysis(outlier_method='iqr')

Selective Analysis

You can also perform selective analyses based on your needs:

# Overview of dataset
eda.data_overview()

# Numerical feature analysis
eda.analyze_numerical_features()

# Categorical feature analysis
eda.analyze_categorical_features()

# Datetime analysis
eda.analyze_datetime_features()

# Correlation analysis
eda.correlation_analysis()

Advanced Customization

The pipeline provides flexibility to configure various thresholds and parameters:

eda.HIGH_CARDINALITY_THRESHOLD = 100
eda.TOP_N_CATEGORIES = 10
eda.detect_outliers(method='zscore', threshold=2.5)

Dependencies

  • NumPy
  • Pandas
  • Matplotlib
  • Seaborn
  • SciPy

Install all dependencies with:

pip install numpy pandas matplotlib seaborn scipy

Contributions

Contributions and suggestions are welcome! Feel free to open issues or submit pull requests on GitHub.

License

EDAPipeline is open-source and available under the MIT License.


Explore your data effortlessly with EDAPipeline—turning data into actionable insights.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edapipeline-0.1.1.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

edapipeline-0.1.1-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file edapipeline-0.1.1.tar.gz.

File metadata

  • Download URL: edapipeline-0.1.1.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for edapipeline-0.1.1.tar.gz
Algorithm Hash digest
SHA256 211ba8ee76d0f41b1c71609c3b7583c87b77d8753757a2c34477b25333264c32
MD5 39657fd2b03bd7f0cb78e47d09764bfb
BLAKE2b-256 3e77f9a9e8c163e1f889228dcd0ab0e6ac970ff4ad377941ec1bfb9d6e4f9767

See more details on using hashes here.

File details

Details for the file edapipeline-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: edapipeline-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for edapipeline-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9e447bf98535f75b9f5cb788fce15fde08d73c6baad4381573afea054d7732b7
MD5 0849e3df0e8c1c3e0d95da17f44687aa
BLAKE2b-256 e7c2c03f0e4c03e14c6ee34d03b3755f9c93791114e985eade28dbe15fc1df1b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page