Data analysis and reporting toolkit
Project description
Smart Datalyzer
Smart Datalyzer is an intelligent, automated toolkit for comprehensive data analysis, visualization, and reporting. It provides ML readiness scoring, advanced statistical diagnostics, and publication-quality visualizations with minimal effort.
๐ Key Features
๐ Data Quality & Profiling
- Smart Dataset Loading: Automatic detection of CSV/XLSX files with type inference
- Duplicate Detection: Identify and report duplicate rows
- Mixed Type Detection: Find columns with inconsistent data types
- Auto Type Conversion: Intelligent conversion of string columns to numeric
- Missing Value Analysis: Detection and imputation suggestions
- Constant Column Detection: Identify features with zero variance
- Scaling Issue Detection: Flag features with extreme value ranges
๐ฏ Target-Aware Analysis (Multiple Targets Support)
- Target Leakage Detection: Identify features that leak target information (>95% accuracy)
- Class Imbalance Analysis: Compute imbalance ratios and distribution statistics
- Feature-Target Association: Statistical tests (ANOVA, Kruskal-Wallis, Chi-square)
- Sensitivity Analysis: Permutation importance for feature ranking
- Model Suggestion: Automatic recommendation (Regression vs Classification)
๐ Statistical Diagnostics
- Normality Testing: Shapiro-Wilk, D'Agostino, Kolmogorov-Smirnov tests with QQ plots
- Outlier Detection: Z-score based detection with percentage reporting
- Correlation Analysis: Pearson, Spearman, Kendall correlation matrices
- VIF Computation: Variance Inflation Factor for multicollinearity detection
- Mutual Information: Feature importance via mutual information scores
- Covariance Matrix: Full covariance analysis with CSV export
- High Correlation Flagging: Automatic detection of correlated pairs (>0.9)
๐ Visualization Suite
- Distribution Plots: Histograms with KDE overlays
- Box Plots: Outlier visualization with quartile analysis
- Violin Plots: Distribution density visualization
- Swarm Plots: Individual data point overlay on boxplots
- QQ Plots: Quantile-quantile plots for normality assessment
- Correlation Heatmaps: Multiple correlation methods with annotations
- Feature Importance Charts: RandomForest-based importance ranking
- PCA Variance Plots: Principal component analysis visualization
- t-SNE Scatter Plots: 2D dimensionality reduction visualization
๐ Reporting & Export
- Interactive HTML Reports: Comprehensive analysis with embedded visualizations
- JSON Export: Machine-readable summary statistics
- PDF Generation: Publication-ready reports (optional)
- Plot Export: High-resolution PNG plots (300 DPI)
- Caching System: Smart caching for faster re-analysis
๐ค Smart Auto Mode
- Automatic feature engineering recommendations
- ML readiness scoring (0-100)
- Actionable improvement suggestions
- Complete pipeline execution with single flag
๐ฆ Installation
From Source (Recommended)
# Clone the repository
git clone https://github.com/mehmoodulhaq570/smart-datalyzer.git
cd smart-datalyzer
# Install build tools
pip install build
# Build the package
python -m build
# Install
pip install dist/smart_datalyzer-0.1.1-py3-none-any.whl
Development Install
pip install -e .
๐ฎ Usage
Basic Usage (Single Target)
python -m smart-datalyzer data.xlsx "target_column"
Or using the installed command:
smart-datalyzer data.xlsx "target_column"
Multiple Target Columns
python -m smart-datalyzer data.csv "target1" "target2" "target3"
Command Line Arguments
python -m smart-datalyzer <file> <target> [OPTIONS]
# or
smart-datalyzer <file> <target> [OPTIONS]
Arguments:
file Path to dataset (CSV or XLSX)
target Target column name(s) - space separated for multiple
Options:
--stats Run detailed statistical analysis
--outliers Detect and report outliers
--leakage Detect target leakage features
--imbalance Check class imbalance
--plots Generate all visualization plots
--report Generate interactive HTML/JSON report
--auto Run full automatic analysis (recommended)
--max_rows N Limit rows to read (default: 100000)
--output_dir DIR Output directory (default: "reports")
Examples
Quick Analysis:
python -m smart-datalyzer sales.xlsx "Revenue" --auto
Detailed Statistical Report:
smart-datalyzer customers.csv "Churn" --stats --plots --report
Multiple Targets with Custom Output:
smart-datalyzer experiment.xlsx "Outcome1" "Outcome2" --auto --output_dir results
Outlier & Leakage Detection:
python -m smart-datalyzer medical.csv "Disease" --outliers --leakage
๐ Output Structure
reports/
โโโ plots/
โ โโโ *_distribution.png # Distribution histograms
โ โโโ *_boxplot.png # Box plots
โ โโโ *_violinplot.png # Violin plots
โ โโโ *_swarmplot.png # Swarm plots
โ โโโ *_qqplot.png # QQ plots
โ โโโ correlation_*.png # Correlation heatmaps
โ โโโ feature_importance.png # Feature importance chart
โ โโโ pca_variance.png # PCA analysis
โ โโโ tsne_scatter.png # t-SNE visualization
โโโ report.html # Interactive HTML report
โโโ summary.json # JSON summary statistics
โโโ covariance_matrix.csv # Covariance matrix
โโโ .cache/ # Analysis cache
๐งฐ Python API Usage
from datalyzer.utils import load_dataset
from datalyzer.stats import feature_statistics, detect_outliers
from datalyzer.plots import plot_distributions, plot_correlation
# Load data
df = load_dataset("data.csv")
# Get statistics
stats, readiness, suggestions = feature_statistics(df)
print(f"ML Readiness Score: {readiness}/100")
# Detect outliers
outliers = detect_outliers(df, df.select_dtypes(include=['float64', 'int64']).columns)
# Generate plots
plot_paths = plot_distributions(df, plots_dir="reports/plots")
correlation_paths = plot_correlation(df, plots_dir="reports/plots")
๐ง Dependencies
Core Requirements
pandas- Data manipulationnumpy- Numerical computingscipy- Statistical functionsstatsmodels- Advanced statisticsscikit-learn- Machine learning utilitiesmatplotlib- Plotting backendseaborn- Statistical visualizationsrich- Terminal formatting
See requirements.txt for complete list.
๐จ Features in Detail
ML Readiness Score
Smart Datalyzer computes an ML readiness score (0-100) based on:
- Missing value percentage
- Constant features
- Numeric vs categorical balance
- Duplicate rows
- Data quality issues
Caching System
Automatically caches analysis results using SHA256 hashing for:
- Faster re-analysis of same datasets
- Incremental updates
- Reduced computation time
Smart Type Inference
Automatically detects and suggests:
- Numeric columns stored as strings
- Categorical features with high cardinality
- Date/time columns
- Mixed-type columns
๐จโ๐ป Author
Mehmood Ul Haq
Email: mehmoodulhaq1040@gmail.com
GitHub: @mehmoodulhaq570
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ค Contributing
Contributions are welcome! Please read CODE_OF_CONDUCT.md first.
๐ Security
For security issues, please see SECURITY.md.
๐ Changelog
v0.1.1 (Current)
- Fixed swarm plot performance issues with large datasets (added sampling limit of 2000 points)
- Fixed filename sanitization for plots with special characters
- Improved visualization generation speed
- Skip class imbalance check for targets with >10 unique values
v0.1.0
- Initial release
- Multiple target column support
- Comprehensive statistical analysis
- Advanced visualization suite
- Smart auto-analysis mode
- Caching system
- Interactive HTML reports
๐ Acknowledgments
Built with modern Python data science stack and best practices for automated data analysis.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file smart_datalyzer-0.1.1.tar.gz.
File metadata
- Download URL: smart_datalyzer-0.1.1.tar.gz
- Upload date:
- Size: 13.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc1a3f52b502b8425725b71c33995fa9a3ad91477f2b92c62c48ab863da8ab1c
|
|
| MD5 |
cac87570963c814d41424f8a62bad7cc
|
|
| BLAKE2b-256 |
0c446658272b6632d1cc1dff8bae472bb44d532044c22318ba4982951cc337bd
|
File details
Details for the file smart_datalyzer-0.1.1-py3-none-any.whl.
File metadata
- Download URL: smart_datalyzer-0.1.1-py3-none-any.whl
- Upload date:
- Size: 21.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e1359ea4d32ab9c005e4aff73801e3c1710297cc98626aa3a01252bcaeaf72b
|
|
| MD5 |
e3c6f2f258f7944e8b8c68c0f4f6924d
|
|
| BLAKE2b-256 |
784d0e2addc9d14d74c2d8c8064a42a39dc7ad856212b89d8fb02aab0f00156a
|