Data analysis and reporting toolkit

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Smart Datalyzer

Smart Datalyzer is an intelligent, automated toolkit for comprehensive data analysis, visualization, and reporting. It provides ML readiness scoring, advanced statistical diagnostics, and publication-quality visualizations with minimal effort.

🚀 Key Features

📊 Data Quality & Profiling

Smart Dataset Loading: Automatic detection of CSV/XLSX files with type inference
Duplicate Detection: Identify and report duplicate rows
Mixed Type Detection: Find columns with inconsistent data types
Auto Type Conversion: Intelligent conversion of string columns to numeric
Missing Value Analysis: Detection and imputation suggestions
Constant Column Detection: Identify features with zero variance
Scaling Issue Detection: Flag features with extreme value ranges

🎯 Target-Aware Analysis (Multiple Targets Support)

Target Leakage Detection: Identify features that leak target information (>95% accuracy)
Class Imbalance Analysis: Compute imbalance ratios and distribution statistics
Feature-Target Association: Statistical tests (ANOVA, Kruskal-Wallis, Chi-square)
Sensitivity Analysis: Permutation importance for feature ranking
Model Suggestion: Automatic recommendation (Regression vs Classification)

📈 Statistical Diagnostics

Normality Testing: Shapiro-Wilk, D'Agostino, Kolmogorov-Smirnov tests with QQ plots
Outlier Detection: Z-score based detection with percentage reporting
Correlation Analysis: Pearson, Spearman, Kendall correlation matrices
VIF Computation: Variance Inflation Factor for multicollinearity detection
Mutual Information: Feature importance via mutual information scores
Covariance Matrix: Full covariance analysis with CSV export
High Correlation Flagging: Automatic detection of correlated pairs (>0.9)

📉 Visualization Suite

Distribution Plots: Histograms with KDE overlays
Box Plots: Outlier visualization with quartile analysis
Violin Plots: Distribution density visualization
Swarm Plots: Individual data point overlay on boxplots
QQ Plots: Quantile-quantile plots for normality assessment
Correlation Heatmaps: Multiple correlation methods with annotations
Feature Importance Charts: RandomForest-based importance ranking
PCA Variance Plots: Principal component analysis visualization
t-SNE Scatter Plots: 2D dimensionality reduction visualization

📝 Reporting & Export

Interactive HTML Reports: Comprehensive analysis with embedded visualizations
JSON Export: Machine-readable summary statistics
PDF Generation: Publication-ready reports (optional)
Plot Export: High-resolution PNG plots (300 DPI)
Caching System: Smart caching for faster re-analysis

🤖 Smart Auto Mode

Automatic feature engineering recommendations
ML readiness scoring (0-100)
Actionable improvement suggestions
Complete pipeline execution with single flag

📦 Installation

From Source (Recommended)

# Clone the repository
git clone https://github.com/mehmoodulhaq570/smart-datalyzer.git
cd smart-datalyzer

# Install build tools
pip install build

# Build the package
python -m build

# Install
pip install dist/smart_datalyzer-0.1.1-py3-none-any.whl

Development Install

pip install -e .

🎮 Usage

Basic Usage (Single Target)

python -m smart-datalyzer data.xlsx "target_column"

Or using the installed command:

smart-datalyzer data.xlsx "target_column"

Multiple Target Columns

python -m smart-datalyzer data.csv "target1" "target2" "target3"

Command Line Arguments

python -m smart-datalyzer <file> <target> [OPTIONS]
# or
smart-datalyzer <file> <target> [OPTIONS]

Arguments:
  file                    Path to dataset (CSV or XLSX)
  target                  Target column name(s) - space separated for multiple

Options:
  --stats                 Run detailed statistical analysis
  --outliers             Detect and report outliers
  --leakage              Detect target leakage features
  --imbalance            Check class imbalance
  --plots                Generate all visualization plots
  --report               Generate interactive HTML/JSON report
  --auto                 Run full automatic analysis (recommended)
  --max_rows N           Limit rows to read (default: 100000)
  --output_dir DIR       Output directory (default: "reports")

Examples

Quick Analysis:

python -m smart-datalyzer sales.xlsx "Revenue" --auto

Detailed Statistical Report:

smart-datalyzer customers.csv "Churn" --stats --plots --report

Multiple Targets with Custom Output:

smart-datalyzer experiment.xlsx "Outcome1" "Outcome2" --auto --output_dir results

Outlier & Leakage Detection:

python -m smart-datalyzer medical.csv "Disease" --outliers --leakage

📊 Output Structure

reports/
├── plots/
│   ├── *_distribution.png      # Distribution histograms
│   ├── *_boxplot.png           # Box plots
│   ├── *_violinplot.png        # Violin plots
│   ├── *_swarmplot.png         # Swarm plots
│   ├── *_qqplot.png            # QQ plots
│   ├── correlation_*.png       # Correlation heatmaps
│   ├── feature_importance.png  # Feature importance chart
│   ├── pca_variance.png        # PCA analysis
│   └── tsne_scatter.png        # t-SNE visualization
├── report.html                 # Interactive HTML report
├── summary.json                # JSON summary statistics
├── covariance_matrix.csv       # Covariance matrix
└── .cache/                     # Analysis cache

🧰 Python API Usage

from datalyzer.utils import load_dataset
from datalyzer.stats import feature_statistics, detect_outliers
from datalyzer.plots import plot_distributions, plot_correlation

# Load data
df = load_dataset("data.csv")

# Get statistics
stats, readiness, suggestions = feature_statistics(df)
print(f"ML Readiness Score: {readiness}/100")

# Detect outliers
outliers = detect_outliers(df, df.select_dtypes(include=['float64', 'int64']).columns)

# Generate plots
plot_paths = plot_distributions(df, plots_dir="reports/plots")
correlation_paths = plot_correlation(df, plots_dir="reports/plots")

🔧 Dependencies

Core Requirements

pandas - Data manipulation
numpy - Numerical computing
scipy - Statistical functions
statsmodels - Advanced statistics
scikit-learn - Machine learning utilities
matplotlib - Plotting backend
seaborn - Statistical visualizations
rich - Terminal formatting

See requirements.txt for complete list.

🎨 Features in Detail

ML Readiness Score

Smart Datalyzer computes an ML readiness score (0-100) based on:

Missing value percentage
Constant features
Numeric vs categorical balance
Duplicate rows
Data quality issues

Caching System

Automatically caches analysis results using SHA256 hashing for:

Faster re-analysis of same datasets
Incremental updates
Reduced computation time

Smart Type Inference

Automatically detects and suggests:

Numeric columns stored as strings
Categorical features with high cardinality
Date/time columns
Mixed-type columns

👨‍💻 Author

Mehmood Ul Haq
Email: mehmoodulhaq1040@gmail.com
GitHub: @mehmoodulhaq570

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please read CODE_OF_CONDUCT.md first.

🔒 Security

For security issues, please see SECURITY.md.

📝 Changelog

v0.1.1 (Current)

Fixed swarm plot performance issues with large datasets (added sampling limit of 2000 points)
Fixed filename sanitization for plots with special characters
Improved visualization generation speed
Skip class imbalance check for targets with >10 unique values

v0.1.0

Initial release
Multiple target column support
Comprehensive statistical analysis
Advanced visualization suite
Smart auto-analysis mode
Caching system
Interactive HTML reports

🙏 Acknowledgments

Built with modern Python data science stack and best practices for automated data analysis.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.1

Nov 6, 2025

0.1.0

Oct 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smart_datalyzer-0.1.1.tar.gz (13.3 MB view details)

Uploaded Nov 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

smart_datalyzer-0.1.1-py3-none-any.whl (21.6 kB view details)

Uploaded Nov 6, 2025 Python 3

File details

Details for the file smart_datalyzer-0.1.1.tar.gz.

File metadata

Download URL: smart_datalyzer-0.1.1.tar.gz
Upload date: Nov 6, 2025
Size: 13.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.10

File hashes

Hashes for smart_datalyzer-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`bc1a3f52b502b8425725b71c33995fa9a3ad91477f2b92c62c48ab863da8ab1c`
MD5	`cac87570963c814d41424f8a62bad7cc`
BLAKE2b-256	`0c446658272b6632d1cc1dff8bae472bb44d532044c22318ba4982951cc337bd`

See more details on using hashes here.

File details

Details for the file smart_datalyzer-0.1.1-py3-none-any.whl.

File metadata

Download URL: smart_datalyzer-0.1.1-py3-none-any.whl
Upload date: Nov 6, 2025
Size: 21.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.10

File hashes

Hashes for smart_datalyzer-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1e1359ea4d32ab9c005e4aff73801e3c1710297cc98626aa3a01252bcaeaf72b`
MD5	`e3c6f2f258f7944e8b8c68c0f4f6924d`
BLAKE2b-256	`784d0e2addc9d14d74c2d8c8064a42a39dc7ad856212b89d8fb02aab0f00156a`

See more details on using hashes here.

smart-datalyzer 0.1.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Smart Datalyzer

🚀 Key Features

📊 Data Quality & Profiling

🎯 Target-Aware Analysis (Multiple Targets Support)

📈 Statistical Diagnostics

📉 Visualization Suite

📝 Reporting & Export

🤖 Smart Auto Mode

📦 Installation

From Source (Recommended)

Development Install

🎮 Usage

Basic Usage (Single Target)

Multiple Target Columns

Command Line Arguments

Examples

📊 Output Structure

🧰 Python API Usage

🔧 Dependencies

Core Requirements

🎨 Features in Detail

ML Readiness Score

Caching System

Smart Type Inference

👨‍💻 Author

📄 License

🤝 Contributing

🔒 Security

📝 Changelog

v0.1.1 (Current)

v0.1.0

🙏 Acknowledgments

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes