Library to help accelerataing Exploratory Data Analysis (EDA)
Project description
edaSol
A Python library to accelerate Exploratory Data Analysis (EDA).
edaSol provides simple, intuitive functions to quickly understand your dataset's structure, quality, and distributions without writing repetitive boilerplate code.
Features
- Quick Data Summary - Get data types, null counts, unique values at a glance
- Data Quality Reports - Comprehensive analysis of missing values, duplicates, and memory usage
- Outlier Detection - IQR-based outlier identification
- Categorical Analysis - Statistics for categorical columns
- Visualization Suite - Distribution plots, correlation heatmaps, boxplots, and more
Installation
pip install edaSol
Or install from source:
git clone https://github.com/SoloWPM23/edaSol.git
cd edaSol
pip install -e .
Quick Start
import pandas as pd
from edaSol import quick_summary, data_quality_report, plot_numerical_dist
# Load your data
df = pd.read_csv('your_data.csv')
# Get a quick summary of all columns
summary = quick_summary(df)
print(summary)
# Generate a comprehensive quality report
report = data_quality_report(df)
print(f"Shape: {report['shape']}")
print(f"Duplicates: {report['duplicates']}")
print(f"Memory: {report['memory_usage']} MB")
# Visualize numeric distributions
plot_numerical_dist(df)
API Reference
Core Functions
quick_summary(df, columns=None)
Returns a summary DataFrame with key statistics for each column.
from edaSol import quick_summary
summary = quick_summary(df)
# Returns: Data Type, Null Count, Null Percent, Unique Count, Sample Value
detect_outliers_iqr(df, column, return_bounds=False)
Detects outliers using the IQR (Interquartile Range) method.
from edaSol import detect_outliers_iqr
# Get outlier indices
outliers = detect_outliers_iqr(df, 'price')
# Get outliers with bounds
outliers, lower, upper = detect_outliers_iqr(df, 'price', return_bounds=True)
describe_categorical(df, columns=None)
Returns descriptive statistics for categorical columns.
from edaSol import describe_categorical
cat_stats = describe_categorical(df)
# Returns: Count, Unique, Top Value, Top Frequency, Top Percent
detect_duplicates(df, subset=None, keep='first')
Finds and returns duplicate rows.
from edaSol import detect_duplicates
duplicates = detect_duplicates(df)
duplicates_by_cols = detect_duplicates(df, subset=['name', 'email'])
data_quality_report(df)
Generates a comprehensive data quality report.
from edaSol import data_quality_report
report = data_quality_report(df)
# Returns dict with: shape, memory_usage, dtypes, missing, duplicates,
# numeric_summary, categorical_summary
Visualization Functions
plot_numerical_dist(df, columns=None, figsize=(12, 4), show=True)
Creates histogram with KDE plots for numeric columns.
from edaSol import plot_numerical_dist
plot_numerical_dist(df)
plot_numerical_dist(df, columns=['age', 'salary'])
plot_correlation_heatmap(df, figsize=(10, 8), annot=True, mask_upper=True, show=True)
Creates a correlation heatmap for numeric columns.
from edaSol import plot_correlation_heatmap
plot_correlation_heatmap(df)
plot_correlation_heatmap(df, mask_upper=False) # Show full matrix
plot_missing_matrix(df, figsize=(12, 6), show=True)
Visualizes missing values as a heatmap matrix.
from edaSol import plot_missing_matrix
plot_missing_matrix(df)
plot_categorical_dist(df, columns=None, figsize=(12, 4), top_n=10, show=True)
Creates bar plots for categorical columns.
from edaSol import plot_categorical_dist
plot_categorical_dist(df)
plot_categorical_dist(df, top_n=5) # Show only top 5 categories
plot_boxplots(df, columns=None, figsize=(12, 4), show=True)
Creates boxplots to visualize distributions and outliers.
from edaSol import plot_boxplots
plot_boxplots(df)
plot_boxplots(df, columns=['age', 'income'])
plot_pairplot(df, columns=None, hue=None, diag_kind='kde', show=True)
Creates pairwise scatter plots for numeric columns.
from edaSol import plot_pairplot
plot_pairplot(df)
plot_pairplot(df, hue='category') # Color by category
Requirements
- Python >= 3.8
- pandas >= 1.5.0
- numpy >= 1.20.0
- matplotlib >= 3.5.0
- seaborn >= 0.11.0
License
MIT License
Author
Solo Manurung (solowandika490@gmail.com)
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file edasol-0.1.0.tar.gz.
File metadata
- Download URL: edasol-0.1.0.tar.gz
- Upload date:
- Size: 9.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
782f12ae17db45fdd605060bcf3cdb180761d33fe537b1deca7bd58e6b01a5f2
|
|
| MD5 |
12eb80a3da7f85d8957683851eb3d062
|
|
| BLAKE2b-256 |
46ee3eae050be4b4ec515a74d8ba478a0ce27efb907c37fe7eb680d4e0db4bbd
|
File details
Details for the file edasol-0.1.0-py3-none-any.whl.
File metadata
- Download URL: edasol-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
164cecf238230095dd578c4eb8096752bdf985111b6cd58f879820413111dc01
|
|
| MD5 |
59fa418f17701695f031652758b606b7
|
|
| BLAKE2b-256 |
691447acf6507de3762d26713b6da53aa086f50848bd8ee997a30151516d0400
|