Skip to main content

Data analysis and reporting toolkit

Project description

Version License: MIT Python Issues Size Downloads

Smart Datalyzer

Smart Datalyzer is an intelligent, automated toolkit for comprehensive data analysis, visualization, and reporting. It provides ML readiness scoring, advanced statistical diagnostics, and publication-quality visualizations with minimal effort.

๐Ÿš€ Key Features

๐Ÿ“Š Data Quality & Profiling

  • Smart Dataset Loading: Automatic detection of CSV/XLSX files with type inference
  • Duplicate Detection: Identify and report duplicate rows
  • Mixed Type Detection: Find columns with inconsistent data types
  • Auto Type Conversion: Intelligent conversion of string columns to numeric
  • Missing Value Analysis: Detection and imputation suggestions
  • Constant Column Detection: Identify features with zero variance
  • Scaling Issue Detection: Flag features with extreme value ranges

๐ŸŽฏ Target-Aware Analysis (Multiple Targets Support)

  • Target Leakage Detection: Identify features that leak target information (>95% accuracy)
  • Class Imbalance Analysis: Compute imbalance ratios and distribution statistics
  • Feature-Target Association: Statistical tests (ANOVA, Kruskal-Wallis, Chi-square)
  • Sensitivity Analysis: Permutation importance for feature ranking
  • Model Suggestion: Automatic recommendation (Regression vs Classification)

๐Ÿ“ˆ Statistical Diagnostics

  • Normality Testing: Shapiro-Wilk, D'Agostino, Kolmogorov-Smirnov tests with QQ plots
  • Outlier Detection: Z-score based detection with percentage reporting
  • Correlation Analysis: Pearson, Spearman, Kendall correlation matrices
  • VIF Computation: Variance Inflation Factor for multicollinearity detection
  • Mutual Information: Feature importance via mutual information scores
  • Covariance Matrix: Full covariance analysis with CSV export
  • High Correlation Flagging: Automatic detection of correlated pairs (>0.9)

๐Ÿ“‰ Visualization Suite

  • Distribution Plots: Histograms with KDE overlays
  • Box Plots: Outlier visualization with quartile analysis
  • Violin Plots: Distribution density visualization
  • Swarm Plots: Individual data point overlay on boxplots
  • QQ Plots: Quantile-quantile plots for normality assessment
  • Correlation Heatmaps: Multiple correlation methods with annotations
  • Feature Importance Charts: RandomForest-based importance ranking
  • PCA Variance Plots: Principal component analysis visualization
  • t-SNE Scatter Plots: 2D dimensionality reduction visualization

๐Ÿ“ Reporting & Export

  • Interactive HTML Reports: Comprehensive analysis with embedded visualizations
  • JSON Export: Machine-readable summary statistics
  • PDF Generation: Publication-ready reports (optional)
  • Plot Export: High-resolution PNG plots (300 DPI)
  • Caching System: Smart caching for faster re-analysis

๐Ÿค– Smart Auto Mode

  • Automatic feature engineering recommendations
  • ML readiness scoring (0-100)
  • Actionable improvement suggestions
  • Complete pipeline execution with single flag

๐Ÿ“ฆ Installation

From Source (Recommended)

# Clone the repository
git clone https://github.com/mehmoodulhaq570/smart-datalyzer.git
cd smart-datalyzer

# Install build tools
pip install build

# Build the package
python -m build

# Install
pip install dist/smart_datalyzer-0.1.1-py3-none-any.whl

Development Install

pip install -e .

๐ŸŽฎ Usage

Basic Usage (Single Target)

python -m smart-datalyzer data.xlsx "target_column"

Or using the installed command:

smart-datalyzer data.xlsx "target_column"

Multiple Target Columns

python -m smart-datalyzer data.csv "target1" "target2" "target3"

Command Line Arguments

python -m smart-datalyzer <file> <target> [OPTIONS]
# or
smart-datalyzer <file> <target> [OPTIONS]

Arguments:
  file                    Path to dataset (CSV or XLSX)
  target                  Target column name(s) - space separated for multiple

Options:
  --stats                 Run detailed statistical analysis
  --outliers             Detect and report outliers
  --leakage              Detect target leakage features
  --imbalance            Check class imbalance
  --plots                Generate all visualization plots
  --report               Generate interactive HTML/JSON report
  --auto                 Run full automatic analysis (recommended)
  --max_rows N           Limit rows to read (default: 100000)
  --output_dir DIR       Output directory (default: "reports")

Examples

Quick Analysis:

python -m smart-datalyzer sales.xlsx "Revenue" --auto

Detailed Statistical Report:

smart-datalyzer customers.csv "Churn" --stats --plots --report

Multiple Targets with Custom Output:

smart-datalyzer experiment.xlsx "Outcome1" "Outcome2" --auto --output_dir results

Outlier & Leakage Detection:

python -m smart-datalyzer medical.csv "Disease" --outliers --leakage

๐Ÿ“Š Output Structure

reports/
โ”œโ”€โ”€ plots/
โ”‚   โ”œโ”€โ”€ *_distribution.png      # Distribution histograms
โ”‚   โ”œโ”€โ”€ *_boxplot.png           # Box plots
โ”‚   โ”œโ”€โ”€ *_violinplot.png        # Violin plots
โ”‚   โ”œโ”€โ”€ *_swarmplot.png         # Swarm plots
โ”‚   โ”œโ”€โ”€ *_qqplot.png            # QQ plots
โ”‚   โ”œโ”€โ”€ correlation_*.png       # Correlation heatmaps
โ”‚   โ”œโ”€โ”€ feature_importance.png  # Feature importance chart
โ”‚   โ”œโ”€โ”€ pca_variance.png        # PCA analysis
โ”‚   โ””โ”€โ”€ tsne_scatter.png        # t-SNE visualization
โ”œโ”€โ”€ report.html                 # Interactive HTML report
โ”œโ”€โ”€ summary.json                # JSON summary statistics
โ”œโ”€โ”€ covariance_matrix.csv       # Covariance matrix
โ””โ”€โ”€ .cache/                     # Analysis cache

๐Ÿงฐ Python API Usage

from datalyzer.utils import load_dataset
from datalyzer.stats import feature_statistics, detect_outliers
from datalyzer.plots import plot_distributions, plot_correlation

# Load data
df = load_dataset("data.csv")

# Get statistics
stats, readiness, suggestions = feature_statistics(df)
print(f"ML Readiness Score: {readiness}/100")

# Detect outliers
outliers = detect_outliers(df, df.select_dtypes(include=['float64', 'int64']).columns)

# Generate plots
plot_paths = plot_distributions(df, plots_dir="reports/plots")
correlation_paths = plot_correlation(df, plots_dir="reports/plots")

๐Ÿ”ง Dependencies

Core Requirements

  • pandas - Data manipulation
  • numpy - Numerical computing
  • scipy - Statistical functions
  • statsmodels - Advanced statistics
  • scikit-learn - Machine learning utilities
  • matplotlib - Plotting backend
  • seaborn - Statistical visualizations
  • rich - Terminal formatting

See requirements.txt for complete list.

๐ŸŽจ Features in Detail

ML Readiness Score

Smart Datalyzer computes an ML readiness score (0-100) based on:

  • Missing value percentage
  • Constant features
  • Numeric vs categorical balance
  • Duplicate rows
  • Data quality issues

Caching System

Automatically caches analysis results using SHA256 hashing for:

  • Faster re-analysis of same datasets
  • Incremental updates
  • Reduced computation time

Smart Type Inference

Automatically detects and suggests:

  • Numeric columns stored as strings
  • Categorical features with high cardinality
  • Date/time columns
  • Mixed-type columns

๐Ÿ‘จโ€๐Ÿ’ป Author

Mehmood Ul Haq
Email: mehmoodulhaq1040@gmail.com
GitHub: @mehmoodulhaq570

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿค Contributing

Contributions are welcome! Please read CODE_OF_CONDUCT.md first.

๐Ÿ”’ Security

For security issues, please see SECURITY.md.

๐Ÿ“ Changelog

v0.1.1 (Current)

  • Fixed swarm plot performance issues with large datasets (added sampling limit of 2000 points)
  • Fixed filename sanitization for plots with special characters
  • Improved visualization generation speed
  • Skip class imbalance check for targets with >10 unique values

v0.1.0

  • Initial release
  • Multiple target column support
  • Comprehensive statistical analysis
  • Advanced visualization suite
  • Smart auto-analysis mode
  • Caching system
  • Interactive HTML reports

๐Ÿ™ Acknowledgments

Built with modern Python data science stack and best practices for automated data analysis.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smart_datalyzer-0.1.1.tar.gz (13.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smart_datalyzer-0.1.1-py3-none-any.whl (21.6 kB view details)

Uploaded Python 3

File details

Details for the file smart_datalyzer-0.1.1.tar.gz.

File metadata

  • Download URL: smart_datalyzer-0.1.1.tar.gz
  • Upload date:
  • Size: 13.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.10

File hashes

Hashes for smart_datalyzer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bc1a3f52b502b8425725b71c33995fa9a3ad91477f2b92c62c48ab863da8ab1c
MD5 cac87570963c814d41424f8a62bad7cc
BLAKE2b-256 0c446658272b6632d1cc1dff8bae472bb44d532044c22318ba4982951cc337bd

See more details on using hashes here.

File details

Details for the file smart_datalyzer-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for smart_datalyzer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1e1359ea4d32ab9c005e4aff73801e3c1710297cc98626aa3a01252bcaeaf72b
MD5 e3c6f2f258f7944e8b8c68c0f4f6924d
BLAKE2b-256 784d0e2addc9d14d74c2d8c8064a42a39dc7ad856212b89d8fb02aab0f00156a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page