A Python package for exploratory data analysis workflows with universal dark mode compatibility
Project description
๐ What's New in v0.18.0
โญ NEW FEATURE: Automated Profiling Report!
Generate comprehensive EDA reports with a single function call - similar to ydata-profiling's ProfileReport!
import edaflow
# Generate complete automated profiling report
report = edaflow.profile_report(df)
# Creates 'eda_report_YYYYMMDD_HHMMSS.html' with:
# - Dataset overview (rows, columns, memory, duplicates)
# - Missing value analysis
# - Numerical statistics (mean, std, quartiles)
# - Categorical insights (frequency distributions)
# - Visualizations (histograms, correlation heatmap)
# Or get dictionary for programmatic access
report_dict = edaflow.profile_report(df, output_format="dict")
Key Features:
- ๐ Complete Dataset Overview: Rows, columns, memory usage, duplicates, missing cells
- ๐ Numerical Analysis: Mean, std, quartiles, min/max, missing values, zero counts
- ๐ท๏ธ Categorical Insights: Top N columns by unique count with frequency distributions
- ๐ Visualizations: Histograms for numeric columns, correlation heatmap
- ๐พ Flexible Output: HTML file for reporting or dictionary for automation
- โก Fast & Reliable: 91% test coverage, defensive programming, comprehensive validation
Why Upgrade? Get instant, comprehensive EDA reports without writing repetitive analysis code. Perfect for quick data assessment, automated reporting, and reproducible analysis workflows.
See the full documentation at edaflow.readthedocs.io
Previous Release: v0.17.1
Notebook Fixes & Beginner Experience:
- Fixed confusion matrix example in classification and advanced workflow notebooks to match API signature
- Audited all example notebooks for beginner-friendliness and error-free execution
- All notebooks now run without unnecessary errors for new users
Major Documentation Overhaul for Education:
- Added a dedicated Learning Path for new and aspiring data scientists
- Consolidated ML workflow steps into a single, copy-paste-safe guide
- Expanded examples: classification, regression, and computer vision
- Improved navigation: clear table of contents, user guide, API reference, and best practices
- Advanced features and troubleshooting tips for power users
Why Upgrade? This release makes edaflow best-in-class for educational value, with a structured progression for learners and educators. All documentation is now easier to follow, with practical code and hands-on exercises.
See the full documentation at edaflow.readthedocs.io
edaflow
Quick Navigation: ๐ Documentation | ๐ฆ PyPI Package | ๐ Quick Start | ๐ Changelog | ๐ Issues
A Python package for streamlined exploratory data analysis workflows.
๐ฆ Current Version: v0.18.0 - Latest Release adds automated profiling with
profile_report()for instant comprehensive EDA reports. Updated: December 1, 2025
๐ Table of Contents
- Description
- ๐จ Critical Fixes in v0.15.0
- โจ What's New
- Features
- ๐ Recent Updates
- ๐ Documentation
- Installation
- Quick Start
- ๐ Changelog
- Support
- Roadmap
Description
edaflow is designed to simplify and accelerate the exploratory data analysis (EDA) process by providing a collection of tools and utilities for data scientists and analysts. The package integrates popular data science libraries to create a cohesive workflow for data exploration, visualization, and preprocessing.
๐จ What's New in v0.15.1
NEW: setup_ml_experiment now supports a primary_metric argument, making metric selection robust and error-free for all ML workflows. All documentation, tests, and downstream code are updated for consistency. A new test ensures the metric is set and accessible throughout the workflow.
Upgrade recommended for all users who want reliable, copy-paste-safe ML workflows with dynamic metric selection.
๐จ Critical Fixes in v0.15.0
(Previous release)
๐ฏ Issues Resolved:
- โ FIXED:
RandomForestClassifier instance is not fitted yeterrors - โ FIXED:
TypeError: unexpected keyword argumenterrors - โ FIXED: Missing imports and undefined variables in examples
- โ FIXED: Duplicate step numbering in documentation
- โ RESULT: All ML workflow examples now work perfectly!
๐ What This Means For You:
- ๐ Copy-paste examples that work immediately
- ๐ฏ No more confusing error messages
- ๐ Complete, beginner-friendly documentation
- ๐ Smooth learning experience for new users
Upgrade recommended for all users following ML workflow documentation.
โจ What's New
๐จ Critical ML Documentation Fixes (v0.15.0)
MAJOR DOCUMENTATION UPDATE: Fixed critical issues that were causing user errors when following ML workflow examples.
Problems Resolved:
- โ
Model Fitting: Added missing
model.fit()steps that were causing "not fitted" errors - โ Function Parameters: Fixed incorrect parameter names in all examples
- โ Missing Context: Added imports and data preparation context
- โ Step Numbering: Corrected duplicate step numbers in documentation
- โ Enhanced Warnings: Added prominent warnings about critical requirements
Result: All ML workflow documentation now works perfectly out-of-the-box!
๐ฏ Enhanced rank_models Function (v0.14.x)
DUAL RETURN FORMAT SUPPORT: Major enhancement based on user requests.
# Both formats now supported:
df_results = ml.rank_models(results, 'accuracy') # DataFrame (default)
list_results = ml.rank_models(results, 'accuracy', return_format='list') # List of dicts
# User-requested pattern now works:
best_model = ml.rank_models(results, 'accuracy', return_format='list')[0]["model_name"]
๐ ML Expansion (v0.13.0+)
COMPLETE MACHINE LEARNING SUBPACKAGE: Extended edaflow into full ML workflows.
New ML Modules Added:
ml.config: ML experiment setup and data validationml.leaderboard: Multi-model comparison and rankingml.tuning: Advanced hyperparameter optimizationml.curves: Learning curves and performance visualizationml.artifacts: Model persistence and experiment tracking
Key ML Features:
# Complete ML workflow in one package
import edaflow.ml as ml
# Setup experiment with flexible parameter support
# Both calling patterns work:
experiment = ml.setup_ml_experiment(df, 'target') # DataFrame style
# OR
experiment = ml.setup_ml_experiment(X=X, y=y, val_size=0.15) # sklearn style
# Compare multiple models
results = ml.compare_models(models, **experiment)
# Optimize hyperparameters with multiple strategies
best_model = ml.optimize_hyperparameters(model, params, **experiment)
# Generate comprehensive visualizations
ml.plot_learning_curves(model, **experiment)
Previous: API Improvement (v0.12.33)
NEW CLEAN APIs: Introduced consistent, user-friendly encoding functions that eliminate confusion and crashes.
Root Cause Solved: The inconsistent return type of apply_smart_encoding() (sometimes DataFrame, sometimes tuple) was causing AttributeError crashes and user confusion.
New Functions Added:
# โ
NEW: Clean, consistent DataFrame return (RECOMMENDED)
df_encoded = edaflow.apply_encoding(df) # Always returns DataFrame
# โ
NEW: Explicit tuple return when encoders needed
df_encoded, encoders = edaflow.apply_encoding_with_encoders(df) # Always returns tuple
# โ ๏ธ DEPRECATED: Inconsistent behavior (still works with warnings)
df_encoded = edaflow.apply_smart_encoding(df, return_encoders=True) # Sometimes tuple!
Benefits:
- ๐ฏ Zero Breaking Changes: All existing workflows continue working exactly the same
- ๐ก๏ธ Better Error Messages: Helpful guidance when mistakes are made
- ๐ Migration Path: Multiple options for users who want cleaner APIs
- ๐ Clear Documentation: Explicit examples showing best practices
๐ Critical Input Validation Fix (v0.12.32)
RESOLVED: Fixed AttributeError: 'tuple' object has no attribute 'empty' in visualization functions when apply_smart_encoding(..., return_encoders=True) result is used incorrectly.
Problem Solved: Users who passed the tuple result from apply_smart_encoding directly to visualization functions without unpacking were experiencing crashes in step 14 of EDA workflows.
Enhanced Error Messages: Added intelligent input validation with helpful error messages guiding users to the correct usage pattern:
# โ WRONG - This causes the AttributeError:
df_encoded = edaflow.apply_smart_encoding(df, return_encoders=True) # Returns (df, encoders) tuple!
edaflow.visualize_scatter_matrix(df_encoded) # Crashes with AttributeError
# โ
CORRECT - Unpack the tuple:
df_encoded, encoders = edaflow.apply_smart_encoding(df, return_encoders=True)
edaflow.visualize_scatter_matrix(df_encoded) # Should work well!
๐จ BREAKTHROUGH: Universal Dark Mode Compatibility (v0.12.30)
- NEW FUNCTION:
optimize_display()- The FIRST EDA library with universal notebook compatibility! - Universal Platform Support: Improved visibility across Google Colab, JupyterLab, VS Code, and Classic Jupyter
- Automatic Detection: Zero configuration needed - automatically detects your environment
- Accessibility Support: Built-in high contrast mode for improved accessibility
- One-Line Solution:
edaflow.optimize_display()fixes all visibility issues instantly
๐ Critical KeyError Hotfix (v0.12.31)
- Fixed KeyError: Resolved "KeyError: 'type'" in
summarize_eda_insights()function - Enhanced Error Handling: Added robust exception handling for target analysis edge cases
- Improved Stability: Function now handles missing or invalid target columns gracefully
๐ Platform Benefits:
- โ Google Colab: Auto light/dark mode detection with improved text visibility
- โ JupyterLab: Dark theme compatibility with custom theme support
- โ VS Code: Native theme integration with seamless notebook experience
- โ Classic Jupyter: Full compatibility with enhanced readability options
import edaflow
# โญ NEW: Improved visibility everywhere!
edaflow.optimize_display() # Universal dark mode fix!
# All functions now display beautifully
edaflow.check_null_columns(df)
edaflow.visualize_histograms(df)
โจ NEW FUNCTION: summarize_eda_insights() (Added in v0.12.28)
- Comprehensive Analysis: Generate complete EDA insights and actionable recommendations after completing your analysis workflow
- Smart Recommendations: Provides intelligent next steps for modeling, preprocessing, and data quality improvements
- Target-Aware Analysis: Supports both classification and regression scenarios with specific insights
- Function Tracking: Knows which edaflow functions you've already used in your workflow
- Structured Output: Returns organized dictionary with dataset overview, data quality assessment, and recommendations
๐จ Display Formatting Excellence
- Enhanced Visual Experience: Refined Rich console styling with optimized panel borders and alignment
- Google Colab Optimized: Improved display formatting specifically tailored for notebook environments
- Consistent Design: Professional rounded borders, proper width constraints, and refined color schemes
- Universal Compatibility: Beautiful output rendering across all major Python environments and notebooks
๏ฟฝ Recent Fixes (v0.12.24-0.12.26)
- LBP Warning Resolution: Fixed scikit-image UserWarning in texture analysis functions
- Parameter Documentation: Corrected
analyze_image_featuresdocumentation mismatches - RTD Synchronization: Updated Read the Docs changelog with all recent improvements
๐ Rich Styling (v0.12.20-0.12.21)
- Vibrant Output: ALL major EDA functions now feature professional, color-coded styling
- Smart Indicators: Color-coded severity levels (โ CLEAN, โ ๏ธ WARNING, ๐จ CRITICAL)
- Professional Tables: Beautiful formatted output with rich library integration
- Actionable Insights: Context-aware recommendations and visual status indicators
Features
๐ Exploratory Data Analysis
- Missing Data Analysis: Color-coded analysis of null values with customizable thresholds
- Categorical Data Insights: ๐ FIXED in v0.12.29 Identify object columns that might be numeric, detect data type issues (now handles unhashable types)
- Automatic Data Type Conversion: Smart conversion of object columns to numeric when appropriate
- Categorical Values Visualization: Detailed exploration of categorical column values with insights
- Column Type Classification: Simple categorization of DataFrame columns into categorical and numerical types
- Data Type Detection: Smart analysis to flag potential data conversion needs
- EDA Insights Summary: โญ NEW in v0.12.28 Comprehensive EDA insights and actionable recommendations after completing analysis workflow
๐ Advanced Visualizations
- Numerical Distribution Visualization: Advanced boxplot analysis with outlier detection and statistical summaries
- Interactive Boxplot Visualization: Interactive Plotly Express boxplots with zoom, hover, and statistical tooltips
- Comprehensive Heatmap Visualizations: Correlation matrices, missing data patterns, values heatmaps, and cross-tabulations
- Statistical Histogram Analysis: Advanced histogram visualization with skewness detection, normality testing, and distribution analysis
- Scatter Matrix Analysis: Advanced pairwise relationship visualization with customizable matrix layouts, regression lines, and statistical insights
๐ค Machine Learning Preprocessing โญ Introduced in v0.12.0
- Intelligent Encoding Analysis: Automatic detection of optimal encoding strategies for categorical variables
- Smart Encoding Application: Automated categorical encoding with support for:
- One-Hot Encoding for low cardinality categories
- Target Encoding for high cardinality with target correlation
- Ordinal Encoding for ordinal relationships
- Binary Encoding for medium cardinality
- Text Vectorization (TF-IDF) for text features
- Leave Unchanged for numeric columns
- Memory-Efficient Processing: Intelligent handling of high-cardinality features to prevent memory issues
- Comprehensive Encoding Pipeline: End-to-end preprocessing solution for ML model preparation
๐ค Machine Learning Workflows โญ NEW in v0.13.0
The powerful edaflow.ml subpackage provides comprehensive machine learning workflow capabilities:
ML Experiment Setup (ml.config)
- Smart Data Validation: Automatic data quality assessment and problem type detection
- Intelligent Data Splitting: Train/validation/test splits with stratification support
- ML Pipeline Configuration: Automated preprocessing pipeline setup for ML workflows
Model Comparison & Ranking (ml.leaderboard)
- Multi-Model Evaluation: Compare multiple models with comprehensive metrics
- Smart Leaderboards: Automatically rank models by performance with visual displays
- Export Capabilities: Save comparison results for reporting and analysis
Hyperparameter Optimization (ml.tuning)
- Multiple Search Strategies: Grid search, random search, and Bayesian optimization
- Cross-Validation Integration: Built-in CV with customizable scoring metrics
- Parallel Processing: Multi-core hyperparameter optimization for faster results
Learning & Performance Curves (ml.curves)
- Learning Curves: Visualize model performance vs training size
- Validation Curves: Analyze hyperparameter impact on model performance
- ROC & Precision-Recall Curves: Comprehensive classification performance analysis
- Feature Importance: Visual analysis of model feature contributions
Model Persistence & Tracking (ml.artifacts)
- Complete Model Artifacts: Save models, configs, and metadata
- Experiment Tracking: Track multiple experiments with organized storage
- Model Reports: Generate comprehensive model performance reports
- Version Management: Organized model versioning and retrieval
Quick ML Example:
import edaflow.ml as ml
from sklearn.ensemble import RandomForestClassifier
# Setup ML experiment - Multiple parameter patterns supported
# Method 1: DataFrame + target column (recommended)
experiment = ml.setup_ml_experiment(df, target_column='target')
# Method 2: sklearn-style (also supported)
X = df.drop('target', axis=1)
y = df['target']
experiment = ml.setup_ml_experiment(
X=X, y=y,
test_size=0.2,
val_size=0.15, # Alternative to validation_size
experiment_name="my_ml_project",
stratify=True,
random_state=42
)
# Compare multiple models
models = {
'RandomForest': RandomForestClassifier(),
'LogisticRegression': LogisticRegression()
}
comparison = ml.compare_models(models, **experiment)
# Rank models with flexible access patterns
# Method 1: Easy dictionary access (recommended for getting best model)
best_model_name = ml.rank_models(comparison, 'accuracy', return_format='list')[0]['model_name']
# Method 2: Traditional DataFrame format
ranked_df = ml.rank_models(comparison, 'accuracy')
best_model_traditional = ranked_df.iloc[0]['model']
# Both methods give the same result
print(f"Best model: {best_model_name}") # Easy access
print(f"Best model: {best_model_traditional}") # Traditional access
# Optimize hyperparameters
# --- Copy-paste-safe hyperparameter optimization example ---
model_name = 'LogisticRegression' # or 'RandomForest' or 'GradientBoosting'
if model_name == 'RandomForest':
param_distributions = {
'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10]
}
model = RandomForestClassifier()
method = 'grid'
elif model_name == 'GradientBoosting':
param_distributions = {
'n_estimators': (50, 200),
'learning_rate': (0.01, 0.3),
'max_depth': (3, 8)
}
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
method = 'bayesian'
elif model_name == 'LogisticRegression':
param_distributions = {
'C': [0.01, 0.1, 1, 10, 100],
'penalty': ['l1', 'l2', 'elasticnet', 'none'],
'solver': ['lbfgs', 'liblinear', 'saga']
}
model = LogisticRegression(max_iter=1000)
method = 'grid'
else:
raise ValueError(f"Unknown model_name: {model_name}")
results = ml.optimize_hyperparameters(
model,
param_distributions=param_distributions,
**experiment
)
# Generate learning curves
ml.plot_learning_curves(results['best_model'], **experiment)
# Save complete artifacts
ml.save_model_artifacts(
model=results['best_model'],
model_name='optimized_rf',
experiment_config=experiment,
performance_metrics=results['cv_results']
)
๐ผ๏ธ Computer Vision Support
- Computer Vision EDA: Class-wise image sample visualization and comprehensive quality assessment for image classification datasets
- Image Quality Assessment: Automated detection of corrupted images, quality issues, blur, artifacts, and dataset health metrics
Usage Examples
Basic Usage
import edaflow
# Verify installation
message = edaflow.hello()
print(message) # Output: "Hello from edaflow! Ready for exploratory data analysis."
๐ Automated Profiling Report with profile_report โญ NEW in v0.18.0
The profile_report function provides a comprehensive automated analysis of your dataset, similar to ydata-profiling's ProfileReport. It generates a complete overview with dataset statistics, missing values analysis, categorical insights, and visualizations in a single function call:
import pandas as pd
import edaflow
# Create sample data
df = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5, 6, 7, 8],
'age': [25, 32, None, 45, 28, 35, 42, 29],
'income': [50000, 75000, 60000, None, 55000, 80000, 95000, 62000],
'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC', 'Chicago', 'LA'],
'premium': [True, False, True, True, False, True, False, True]
})
# Generate automated HTML report (opens in browser)
report = edaflow.profile_report(df)
# This creates 'eda_report_YYYYMMDD_HHMMSS.html' with complete analysis
# Generate report as dictionary for programmatic access
report_dict = edaflow.profile_report(df, output_format="dict")
print(report_dict['overview']) # Dataset overview
print(report_dict['numerical_summary']) # Numeric statistics
print(report_dict['categorical_summary']) # Categorical analysis
# Customize categorical analysis (default shows top 5 by unique count)
report = edaflow.profile_report(df, top_n_categorical=3)
Report Contents:
- ๐ Dataset Overview: Rows, columns, memory usage, duplicates, missing cells
- ๐ Data Types Summary: Count of numeric, categorical, boolean columns
- ๐ข Numerical Analysis: Mean, std, quartiles, min/max, missing values, zero counts
- ๐ท๏ธ Categorical Analysis: Top N columns by unique count with frequency distributions
- ๐ Visualizations: Histograms for numeric columns, correlation heatmap
- ๐พ Flexible Output: HTML file for reporting or dictionary for automation
Output Formats:
"html"(default): Creates standalone HTML file with embedded visualizations"dict": Returns Python dictionary for integration into data pipelines
Missing Data Analysis with check_null_columns
The check_null_columns function provides a color-coded analysis of missing data in your DataFrame:
import pandas as pd
import edaflow
# Create sample data with missing values
df = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
'age': [25, None, 35, None, 45],
'email': [None, None, None, None, None], # All missing
'purchase_amount': [100.5, 250.0, 75.25, None, 320.0]
})
# Analyze missing data with default threshold (10%)
styled_result = edaflow.check_null_columns(df)
styled_result # Display in Jupyter notebook for color-coded styling
# Use custom threshold (20%) to change color coding sensitivity
styled_result = edaflow.check_null_columns(df, threshold=20)
styled_result
# Access underlying data if needed
data = styled_result.data
print(data)
Color Coding:
- ๐ด Red: > 20% missing (high concern)
- ๐ก Yellow: 10-20% missing (medium concern)
- ๐จ Light Yellow: 1-10% missing (low concern)
- โฌ Gray: 0% missing (no issues)
Categorical Data Analysis with analyze_categorical_columns
The analyze_categorical_columns function helps identify data type issues and provides insights into object-type columns:
import pandas as pd
import edaflow
# Create sample data with mixed categorical types
df = pd.DataFrame({
'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'price_str': ['999', '25', '75', '450'], # Numbers stored as strings
'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics'],
'rating': [4.5, 3.8, 4.2, 4.7], # Already numeric
'mixed_ids': ['001', '002', 'ABC', '004'], # Mixed format
'status': ['active', 'inactive', 'active', 'pending']
})
# Analyze categorical columns with default threshold (35%)
edaflow.analyze_categorical_columns(df)
# Use custom threshold (50%) to be more lenient about mixed data
edaflow.analyze_categorical_columns(df, threshold=50)
Output Interpretation:
- ๐ด๐ต Highlighted in Red/Blue: Potentially numeric columns that might need conversion
- ๐กโซ Highlighted in Yellow/Black: Shows unique values for potential numeric columns
- Regular text: Truly categorical columns with statistics
- "not an object column": Already properly typed numeric columns
Data Type Conversion with convert_to_numeric
After analyzing your categorical columns, you can automatically convert appropriate columns to numeric:
import pandas as pd
import edaflow
# Create sample data with string numbers
df = pd.DataFrame({
'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'price_str': ['999', '25', '75', '450'], # Should convert
'mixed_ids': ['001', '002', 'ABC', '004'], # Mixed data
'category': ['Electronics', 'Accessories', 'Electronics', 'Electronics']
})
# Convert appropriate columns to numeric (threshold=35% by default)
df_converted = edaflow.convert_to_numeric(df, threshold=35)
# Or modify the original DataFrame in place
edaflow.convert_to_numeric(df, threshold=35, inplace=True)
# Use a stricter threshold (only convert if <20% non-numeric values)
df_strict = edaflow.convert_to_numeric(df, threshold=20)
Function Features:
- โ Smart Detection: Only converts columns with few non-numeric values
- โ Customizable Threshold: Control conversion sensitivity
- โ Safe Conversion: Non-numeric values become NaN (not errors)
- โ Inplace Option: Modify original DataFrame or create new one
- โ Detailed Output: Shows exactly what was converted and why
Categorical Data Visualization with visualize_categorical_values
After cleaning your data, explore categorical columns in detail to understand value distributions:
import pandas as pd
import edaflow
# Example DataFrame with categorical data
df = pd.DataFrame({
'department': ['Sales', 'Marketing', 'Sales', 'HR', 'Marketing', 'Sales', 'IT'],
'status': ['Active', 'Inactive', 'Active', 'Pending', 'Active', 'Active', 'Inactive'],
'priority': ['High', 'Medium', 'High', 'Low', 'Medium', 'High', 'Low'],
'employee_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007], # Numeric (ignored)
'salary': [50000, 60000, 70000, 45000, 58000, 62000, 70000] # Numeric (ignored)
})
# Visualize all categorical columns
edaflow.visualize_categorical_values(df)
Advanced Usage Examples:
# Handle high-cardinality data (many unique values)
large_df = pd.DataFrame({
'product_id': [f'PROD_{i:04d}' for i in range(100)], # 100 unique values
'category': ['Electronics'] * 40 + ['Clothing'] * 35 + ['Books'] * 25,
'status': ['Available'] * 80 + ['Out of Stock'] * 15 + ['Discontinued'] * 5
})
# Limit display for high-cardinality columns
edaflow.visualize_categorical_values(large_df, max_unique_values=5)
# DataFrame with missing values for comprehensive analysis
df_with_nulls = pd.DataFrame({
'region': ['North', 'South', None, 'East', 'West', 'North', None],
'customer_type': ['Premium', 'Standard', 'Premium', None, 'Standard', 'Premium', 'Standard'],
'transaction_id': [f'TXN_{i}' for i in range(7)], # Mostly unique (ID-like)
})
# Get detailed insights including missing value analysis
edaflow.visualize_categorical_values(df_with_nulls)
Function Features:
- ๐ฏ Zero Breaking Changes: All existing workflows continue working exactly the same
- ๐ก๏ธ Better Error Messages: Helpful guidance when mistakes are made
- ๐ Migration Path: Multiple options for users who want cleaner APIs
- ๐ Clear Documentation: Explicit examples showing best practices
Column Type Classification with display_column_types
The display_column_types function provides a simple way to categorize DataFrame columns into categorical and numerical types:
import pandas as pd
import edaflow
# Create sample data with mixed types
data = {
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['NYC', 'LA', 'Chicago'],
'salary': [50000, 60000, 70000],
'is_active': [True, False, True]
}
df = pd.DataFrame(data)
# Display column type classification
result = edaflow.display_column_types(df)
# Access the categorized column lists
categorical_cols = result['categorical'] # ['name', 'city']
numerical_cols = result['numerical'] # ['age', 'salary', 'is_active']
Example Output:
๐ Column Type Analysis
==================================================
๐ Categorical Columns (2 total):
1. name (unique values: 3)
2. city (unique values: 3)
๐ข Numerical Columns (3 total):
1. age (dtype: int64)
2. salary (dtype: int64)
3. is_active (dtype: bool)
๐ Summary:
Total columns: 5
Categorical: 2 (40.0%)
Numerical: 3 (60.0%)
Function Features:
- ๐ Simple Classification: Separates columns into categorical (object dtype) and numerical (all other dtypes)
- ๐ Detailed Information: Shows unique value counts for categorical columns and data types for numerical columns
- ๐ Summary Statistics: Provides percentage breakdown of column types
- ๐ฏ Return Values: Returns dictionary with categorized column lists for programmatic use
- โก Fast Processing: Efficient classification based on pandas data types
- ๐ก๏ธ Error Handling: Validates input and handles edge cases like empty DataFrames
Data Imputation with impute_numerical_median and impute_categorical_mode
After analyzing your data, you often need to handle missing values. The edaflow package provides two specialized imputation functions for this purpose:
Numerical Imputation with impute_numerical_median
The impute_numerical_median function fills missing values in numerical columns using the median value:
import pandas as pd
import edaflow
# Create sample data with missing numerical values
df = pd.DataFrame({
'age': [25, None, 35, None, 45],
'salary': [50000, 60000, None, 70000, None],
'score': [85.5, None, 92.0, 88.5, None],
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']
})
# Impute all numerical columns with median values
df_imputed = edaflow.impute_numerical_median(df)
# Impute specific columns only
df_imputed = edaflow.impute_numerical_median(df, columns=['age', 'salary'])
# Impute in place (modifies original DataFrame)
edaflow.impute_numerical_median(df, inplace=True)
Function Features:
- ๐ข Smart Detection: Automatically identifies numerical columns (int, float, etc.)
- ๐ Median Imputation: Uses median values which are robust to outliers
- ๐ฏ Selective Imputation: Option to specify which columns to impute
- ๐ Inplace Option: Modify original DataFrame or create new one
- ๐ก๏ธ Safe Handling: Gracefully handles edge cases like all-missing columns
- ๐ Detailed Reporting: Shows exactly what was imputed and summary statistics
Categorical Imputation with impute_categorical_mode
The impute_categorical_mode function fills missing values in categorical columns using the mode (most frequent value):
import pandas as pd
import edaflow
# Create sample data with missing categorical values
df = pd.DataFrame({
'category': ['A', 'B', 'A', None, 'A'],
'status': ['Active', None, 'Active', 'Inactive', None],
'priority': ['High', 'Medium', None, 'Low', 'High'],
'age': [25, 30, 35, 40, 45]
})
# Impute all categorical columns with mode values
df_imputed = edaflow.impute_categorical_mode(df)
# Impute specific columns only
df_imputed = edaflow.impute_categorical_mode(df, columns=['category', 'status'])
# Impute in place (modifies original DataFrame)
edaflow.impute_categorical_mode(df, inplace=True)
Function Features:
- ๐ Smart Detection: Automatically identifies categorical (object) columns
- ๐ฏ Mode Imputation: Uses most frequent value for each column
- โ๏ธ Tie Handling: Gracefully handles mode ties (multiple values with same frequency)
- ๐ Inplace Option: Modify original DataFrame or create new one
- ๐ก๏ธ Safe Handling: Gracefully handles edge cases like all-missing columns
- ๐ Detailed Reporting: Shows exactly what was imputed and mode tie warnings
Complete Imputation Workflow Example
import pandas as pd
import edaflow
# Sample data with both numerical and categorical missing values
df = pd.DataFrame({
'age': [25, None, 35, None, 45],
'salary': [50000, None, 70000, 80000, None],
'category': ['A', 'B', None, 'A', None],
'status': ['Active', None, 'Active', 'Inactive', None],
'score': [85.5, 92.0, None, 88.5, None]
})
print("Original DataFrame:")
print(df)
print("\n" + "="*50)
# Step 1: Impute numerical columns
print("STEP 1: Numerical Imputation")
df_step1 = edaflow.impute_numerical_median(df)
# Step 2: Impute categorical columns
print("\nSTEP 2: Categorical Imputation")
df_final = edaflow.impute_categorical_mode(df_step1)
print("\nFinal DataFrame (all missing values imputed):")
print(df_final)
# Verify no missing values remain
print(f"\nMissing values remaining: {df_final.isnull().sum().sum()}")
Expected Output:
๐ข Numerical Missing Value Imputation (Median)
=======================================================
๐ age - Imputed 2 values with median: 35.0
๐ salary - Imputed 2 values with median: 70000.0
๐ score - Imputed 1 values with median: 88.75
๐ Imputation Summary:
Columns processed: 3
Columns imputed: 3
Total values imputed: 5
๐ Categorical Missing Value Imputation (Mode)
=======================================================
๐ category - Imputed 2 values with mode: 'A'
๐ status - Imputed 1 values with mode: 'Active'
๐ Imputation Summary:
Columns processed: 2
Columns imputed: 2
Total values imputed: 3
Numerical Distribution Analysis with visualize_numerical_boxplots
Analyze numerical columns to detect outliers, understand distributions, and assess skewness:
import pandas as pd
import edaflow
# Create sample dataset with outliers
df = pd.DataFrame({
'age': [25, 30, 35, 40, 45, 28, 32, 38, 42, 100], # 100 is an outlier
'salary': [50000, 60000, 75000, 80000, 90000, 55000, 65000, 70000, 85000, 250000], # 250000 is outlier
'experience': [2, 5, 8, 12, 15, 3, 6, 9, 13, 30], # 30 might be an outlier
'score': [85, 92, 78, 88, 95, 82, 89, 91, 86, 20], # 20 is an outlier
'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'] # Non-numerical
})
# Basic boxplot analysis
edaflow.visualize_numerical_boxplots(
df,
title="Employee Data Analysis - Outlier Detection",
show_skewness=True
)
# Custom layout and specific columns
edaflow.visualize_numerical_boxplots(
df,
columns=['age', 'salary'],
rows=1,
cols=2,
title="Age vs Salary Analysis",
orientation='vertical',
color_palette='viridis'
)
Expected Output:
๐ Creating boxplots for 4 numerical column(s): age, salary, experience, score
๐ Summary Statistics:
==================================================
๐ age:
Range: 25.00 to 100.00
Median: 36.50
IQR: 11.00 (Q1: 30.50, Q3: 41.50)
Skewness: 2.66 (highly skewed)
Outliers: 1 values outside [14.00, 58.00]
Outlier values: [100]
๐ salary:
Range: 50000.00 to 250000.00
Median: 72500.00
IQR: 22500.00 (Q1: 61250.00, Q3: 83750.00)
Skewness: 2.88 (highly skewed)
Outliers: 1 values outside [27500.00, 117500.00]
Outlier values: [250000]
๐ experience:
Range: 2.00 to 30.00
Median: 8.50
IQR: 7.50 (Q1: 5.25, Q3: 12.75)
Skewness: 1.69 (highly skewed)
Outliers: 1 values outside [-6.00, 24.00]
Outlier values: [30]
๐ score:
Range: 20.00 to 95.00
Median: 87.00
IQR: 7.75 (Q1: 82.75, Q3: 90.50)
Skewness: -2.87 (highly skewed)
Outliers: 1 values outside [71.12, 102.12]
Outlier values: [20]
Complete EDA Workflow Example
import edaflow
import pandas as pd
# Test the installation
print(edaflow.hello())
# Load your data
df = pd.read_csv('your_data.csv')
# Complete EDA workflow with all core functions:
# 1. Analyze missing data with styled output
null_analysis = edaflow.check_null_columns(df, threshold=10)
# 2. Analyze categorical columns to identify data type issues
edaflow.analyze_categorical_columns(df, threshold=35)
# 3. Convert appropriate object columns to numeric automatically
df_cleaned = edaflow.convert_to_numeric(df, threshold=35)
# 4. Visualize categorical column values
edaflow.visualize_categorical_values(df_cleaned)
# 5. Display column type classification
edaflow.display_column_types(df_cleaned)
# 6. Impute missing values
df_numeric_imputed = edaflow.impute_numerical_median(df_cleaned)
df_fully_imputed = edaflow.impute_categorical_mode(df_numeric_imputed)
# 7. Statistical distribution analysis with advanced insights
edaflow.visualize_histograms(df_fully_imputed, kde=True, show_normal_curve=True)
# 8. Comprehensive relationship analysis
edaflow.visualize_heatmap(df_fully_imputed, heatmap_type='correlation')
edaflow.visualize_scatter_matrix(df_fully_imputed, show_regression=True)
# 9. Generate comprehensive EDA insights and recommendations
insights = edaflow.summarize_eda_insights(df_fully_imputed, target_column='your_target_col')
print(insights) # View insights dictionary
# 10. Outlier detection and visualization
edaflow.visualize_numerical_boxplots(df_fully_imputed, show_skewness=True)
edaflow.visualize_interactive_boxplots(df_fully_imputed)
# 10. Advanced heatmap analysis
edaflow.visualize_heatmap(df_fully_imputed, heatmap_type='missing')
edaflow.visualize_heatmap(df_fully_imputed, heatmap_type='values')
# 11. Final data cleaning with outlier handling
df_final = edaflow.handle_outliers_median(df_fully_imputed, method='iqr', verbose=True)
# 12. Results verification
edaflow.visualize_scatter_matrix(df_final, title="Clean Data Relationships")
edaflow.visualize_numerical_boxplots(df_final, title="Final Clean Distribution")
๐ค Complete ML Workflow โญ Enhanced in v0.14.0
import edaflow.ml as ml
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Continue from cleaned data above...
df_final['target'] = your_target_data # Add your target column
# 1. Setup ML experiment โญ NEW: Enhanced parameters in v0.14.0
experiment = ml.setup_ml_experiment(
df_final, 'target',
test_size=0.2, # Test set: 20%
val_size=0.15, # โญ NEW: Validation set: 15%
experiment_name="production_ml_pipeline", # โญ NEW: Experiment tracking
random_state=42,
stratify=True
)
# Alternative: sklearn-style calling (also enhanced)
# X = df_final.drop('target', axis=1)
# y = df_final['target']
# experiment = ml.setup_ml_experiment(X=X, y=y, val_size=0.15, experiment_name="sklearn_workflow")
print(f"Training: {len(experiment['X_train'])}, Validation: {len(experiment['X_val'])}, Test: {len(experiment['X_test'])}")
# 2. Compare multiple models โญ Enhanced with validation set support
models = {
'RandomForest': RandomForestClassifier(random_state=42),
'GradientBoosting': GradientBoostingClassifier(random_state=42),
'LogisticRegression': LogisticRegression(random_state=42),
'SVM': SVC(random_state=42, probability=True)
}
# Fit all models
for name, model in models.items():
model.fit(experiment['X_train'], experiment['y_train'])
# โญ Enhanced compare_models with experiment_config support
comparison = ml.compare_models(
models=models,
experiment_config=experiment, # โญ NEW: Automatically uses validation set
verbose=True
)
print(comparison) # Professional styled output
# โญ Enhanced rank_models with flexible return formats
# Quick access to best model (list format - NEW)
best_model = ml.rank_models(comparison, 'accuracy', return_format='list')[0]['model_name']
print(f"๐ Best model: {best_model}")
# Detailed ranking analysis (DataFrame format - traditional)
ranked_models = ml.rank_models(comparison, 'accuracy')
print("๐ Top 3 models:")
print(ranked_models.head(3)[['model', 'accuracy', 'f1', 'rank']])
# Advanced: Multi-metric weighted ranking
weighted_ranking = ml.rank_models(
comparison,
'accuracy',
weights={'accuracy': 0.4, 'f1': 0.3, 'precision': 0.3},
return_format='list'
)
print(f"๐ฏ Best by weighted score: {weighted_ranking[0]['model_name']}")
# 3. Hyperparameter optimization โญ Enhanced with validation set
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10]
}
best_results = ml.optimize_hyperparameters(
RandomForestClassifier(random_state=42),
param_distributions=param_grid,
X_train=experiment['X_train'],
y_train=experiment['y_train'],
method='grid_search',
cv=5
)
# 4. Generate comprehensive performance visualizations
ml.plot_learning_curves(best_results['best_model'],
X_train=experiment['X_train'], y_train=experiment['y_train'])
ml.plot_roc_curves({'optimized_model': best_results['best_model']},
X_test=experiment['X_test'], y_test=experiment['y_test'])
ml.plot_feature_importance(best_results['best_model'],
feature_names=experiment['feature_names'])
# 5. Save complete model artifacts with experiment tracking
ml.save_model_artifacts(
model=best_results['best_model'],
model_name=f"{experiment['experiment_name']}_optimized_model", # โญ NEW: Uses experiment name
experiment_config=experiment,
performance_metrics={
'cv_score': best_results['best_score'],
'test_score': best_results['best_model'].score(experiment['X_test'], experiment['y_test']),
'model_type': 'RandomForestClassifier'
},
metadata={
'experiment_name': experiment['experiment_name'], # โญ NEW: Experiment tracking
'data_shape': df_final.shape,
'feature_count': len(experiment['feature_names'])
}
)
print(f"โ
Complete ML pipeline finished! Experiment: {experiment['experiment_name']}")
๐ค ML Preprocessing with Smart Encoding โญ Introduced in v0.12.0
import edaflow
import pandas as pd
# Load your data
df = pd.read_csv('your_data.csv')
# Step 1: Analyze encoding needs (with or without target)
encoding_analysis = edaflow.analyze_encoding_needs(
df,
target_column=None, # Optional: specify target if you have one
max_cardinality_onehot=15, # Optional: max categories for one-hot encoding
max_cardinality_target=50, # Optional: max categories for target encoding
ordinal_columns=None # Optional: specify ordinal columns if known
)
# Step 2: Apply intelligent encoding transformations
df_encoded = edaflow.apply_smart_encoding(
df, # Use your full dataset (or df.drop('target_col', axis=1) if needed)
encoding_analysis=encoding_analysis, # Optional: use previous analysis
handle_unknown='ignore' # Optional: how to handle unknown categories
)
# The encoding pipeline automatically:
# โ
One-hot encodes low cardinality categoricals
# โ
Target encodes high cardinality with target correlation
# โ
Binary encodes medium cardinality features
# โ
TF-IDF vectorizes text columns
# โ
Preserves numeric columns unchanged
# โ
Handles memory efficiently for large datasets
print(f"Shape transformation: {df.shape} โ {df_encoded.shape}")
print(f"Encoding methods applied: {len(encoding_analysis['encoding_methods'])} different strategies")
Project Structure
edaflow/
โโโ edaflow/
โ โโโ __init__.py
โ โโโ analysis/
โ โโโ visualization/
โ โโโ preprocessing/
โโโ tests/
โโโ docs/
โโโ examples/
โโโ setup.py
โโโ requirements.txt
โโโ README.md
โโโ LICENSE
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature) - Commit your changes (
git commit -m 'Add new feature') - Push to the branch (
git push origin feature/new-feature) - Open a Pull Request
Development
Setup Development Environment
# Clone the repository
git clone https://github.com/evanlow/edaflow.git
cd edaflow
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Run linting
flake8 edaflow/
black edaflow/
isort edaflow/
License
This project is licensed under the MIT License - see the LICENSE file for details.
Changelog
๐ Latest Updates: This changelog reflects the most current releases including v0.12.32 critical input validation fix, v0.12.31 hotfix with KeyError resolution and v0.12.30 universal display optimization breakthrough.
v0.12.32 (2025-08-11) - Critical Input Validation Fix ๐
- CRITICAL: Fixed AttributeError: 'tuple' object has no attribute 'empty' in visualization functions
- ROOT CAUSE: Users passing tuple result from
apply_smart_encoding(..., return_encoders=True)directly to visualization functions - ENHANCED: Added intelligent input validation with helpful error messages for common usage mistakes
- IMPROVED: Better error handling in
visualize_scatter_matrixand other visualization functions - DOCUMENTED: Clear examples showing correct vs incorrect usage patterns for
apply_smart_encoding - STABILITY: Prevents crashes in step 14 of EDA workflows when encoding functions are misused
v0.12.31 (2025-01-05) - Critical KeyError Hotfix ๐จ
- CRITICAL: Fixed KeyError: 'type' in
summarize_eda_insights()function during Google Colab usage - RESOLVED: Exception handling when target analysis dictionary missing expected keys
- IMPROVED: Enhanced error handling with safe dictionary access using
.get()method - MAINTAINED: All existing functionality preserved - pure stability fix
- TESTED: Verified fix works across all notebook platforms (Colab, JupyterLab, VS Code)
v0.12.30 (2025-01-05) - Universal Display Optimization Breakthrough ๐จ
- BREAKTHROUGH: Introduced
optimize_display()function for universal notebook compatibility - REVOLUTIONARY: Automatic platform detection (Google Colab, JupyterLab, VS Code Notebooks, Classic Jupyter)
- ENHANCED: Dynamic CSS injection for perfect dark/light mode visibility across all platforms
- NEW FEATURE: Automatic matplotlib backend optimization for each notebook environment
- ACCESSIBILITY: Solves visibility issues in dark mode themes universally
- SEAMLESS: Zero configuration required - automatically detects and optimizes for your platform
- COMPATIBILITY: Works flawlessly across Google Colab, JupyterLab, VS Code, Classic Jupyter
- EXAMPLE: Simple usage:
from edaflow import optimize_display; optimize_display()
v0.12.3 (2025-08-06) - Complete Positional Argument Compatibility Fix ๐ง
- CRITICAL: Fixed positional argument usage for
visualize_image_classes()function - RESOLVED: TypeError when calling
visualize_image_classes(image_paths, ...)with positional arguments - ENHANCED: Comprehensive backward compatibility supporting all three usage patterns:
- Positional:
visualize_image_classes(path, ...)(shows warning) - Deprecated keyword:
visualize_image_classes(image_paths=path, ...)(shows warning) - Recommended:
visualize_image_classes(data_source=path, ...)(no warning)
- Positional:
- IMPROVED: Clear deprecation warnings guiding users toward recommended syntax
- SECURE: Prevents using both parameters simultaneously to avoid confusion
- RESOLVED: TypeError for users calling with
image_paths=parameter from v0.12.0 breaking change - ENHANCED: Improved error messages for parameter validation in image visualization functions
- DOCUMENTATION: Added comprehensive parameter documentation including deprecation notices
v0.12.2 (2025-08-06) - Documentation Refresh ๐
- IMPROVED: Enhanced README.md with updated timestamps and current version indicators
- FIXED: Ensured PyPI displays the most current changelog information including v0.12.1 fixes
- ENHANCED: Added latest updates indicator to changelog for better visibility
- DOCUMENTATION: Forced PyPI cache refresh to display current version information
โจ What's New in v0.16.2
New Features:
- Faceted visualizations with
display_facet_grid - Feature scaling with
scale_features - Grouping rare categories with
group_rare_categories - Exporting figures with
export_figure
Documentation Updates:
- User Guide, Advanced Features, and Best Practices now reference all new APIs
- Visualization Guide includes external library requirements and troubleshooting
- Changelog documents all new features and documentation changes
External Library Requirements: Some advanced features require additional libraries:
- matplotlib
- seaborn
- scikit-learn
- statsmodels
- pandas
See the Visualization Guide for installation instructions and troubleshooting tips.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file edaflow-0.18.1.tar.gz.
File metadata
- Download URL: edaflow-0.18.1.tar.gz
- Upload date:
- Size: 818.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4ea3cc2f41718f6ac5033648aa3db178cfe2ed0bee00653b84ee033281663d1
|
|
| MD5 |
ded9604b8fea39112b543fc74038dbc9
|
|
| BLAKE2b-256 |
933483b7721b2c5f5bd500d6cbc1e92b401f13dc55f26c0205aff5db326f5e66
|
File details
Details for the file edaflow-0.18.1-py3-none-any.whl.
File metadata
- Download URL: edaflow-0.18.1-py3-none-any.whl
- Upload date:
- Size: 123.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8829b0c4f87b5058faf949d4c256b45be759f05e499354393a116f51a502053
|
|
| MD5 |
e86b236dd2bac72463ef292f657b1dae
|
|
| BLAKE2b-256 |
bd1762599bdabdd3a11a654c1289fda05eae07e3848a57667375bdd8c587fb6d
|