Advanced Machine Learning Problem Detection with CLI and GUI interfaces
Project description
ML Sniff 🕵️♂️
Advanced Machine Learning Problem Detection from CSV files and DataFrames
By Sherin Joseph Roy - Startup Founder & Hardware/IoT Enthusiast
ML Sniff is a comprehensive Python package that automatically analyzes your data to determine the most likely machine learning problem type, identifies the target column, suggests appropriate models, and provides advanced data analytics.
🚀 Features
- 🔍 Automatic Target Detection: Uses advanced heuristics to identify the most likely target column
- 🎯 Problem Type Classification: Determines if your data is Classification, Regression, or Clustering
- 🤖 Model Suggestions: Recommends appropriate algorithms with hyperparameters
- 📊 Comprehensive Analysis: Provides detailed statistics and visualizations
- 🏆 Feature Importance: Multiple methods (Random Forest, Mutual Information, Correlation)
- 🔍 Data Quality Assessment: Missing data, duplicates, outliers, and variance analysis
- 📈 Advanced Visualizations: Static plots and interactive Plotly dashboards
- 🖥️ CLI Support: Analyze files directly from the command line
- 🖥️ Web GUI: Beautiful Streamlit interface with interactive dashboards
- 📤 Export Capabilities: Export reports in JSON, CSV, or TXT formats
- 🛠️ Preprocessing Suggestions: Automated recommendations for data preparation
📦 Installation
From PyPI (when published)
pip install ml-sniff
From Source
git clone https://github.com/Sherin-SEF-AI/ml-sniffer.git
cd ml-sniffer
pip install .
🚀 Quick Start
Command Line Interface
Basic analysis:
ml-sniff your_data.csv
Show visualizations:
ml-sniff your_data.csv --visualize
Create interactive dashboard:
ml-sniff your_data.csv --interactive
Export detailed report:
ml-sniff your_data.csv --export report.json --format json
Show preprocessing suggestions:
ml-sniff your_data.csv --preprocessing
Show feature importance:
ml-sniff your_data.csv --feature-importance
Show data quality report:
ml-sniff your_data.csv --data-quality
Specify target column manually:
ml-sniff your_data.csv --target target_column
Web Interface (GUI)
Launch the beautiful Streamlit web interface:
# Method 1: Using the launcher script
python run_gui.py
# Method 2: Direct streamlit command
streamlit run streamlit_app.py
# Method 3: Using the command line entry point
ml-sniff-gui
The GUI will open in your browser at http://localhost:8501 and provides:
- 📁 File Upload: Drag and drop CSV files
- 🎯 Interactive Analysis: Real-time analysis with visual feedback
- 📊 Interactive Charts: Plotly visualizations with zoom, pan, and hover
- 🏆 Feature Analysis: Multiple importance methods with interactive charts
- 🔍 Data Quality: Comprehensive quality assessment with detailed reports
- 📈 Visualizations: Correlation matrices, distributions, and outlier analysis
- 📤 Export: Download reports in multiple formats
- ⚙️ Customization: Toggle features and analysis options
Python API
from ml_sniff import Sniffer
# Basic analysis
sniffer = Sniffer("your_data.csv")
sniffer.report()
# Advanced analysis with manual target
sniffer = Sniffer("your_data.csv", target_column="target")
sniffer.report()
# Get feature importance
top_features = sniffer.get_top_features(5, method='random_forest')
print(f"Top features: {top_features}")
# Get preprocessing suggestions
suggestions = sniffer.suggest_preprocessing()
print(suggestions)
# Create visualizations
sniffer.visualize_data()
sniffer.create_interactive_dashboard()
# Export report
sniffer.export_report("analysis.json", format="json")
🔧 Advanced Features
Feature Importance Analysis
ML Sniff provides multiple methods for feature importance:
# Random Forest importance
rf_importance = sniffer.get_feature_importance('random_forest')
# Mutual Information
mi_importance = sniffer.get_feature_importance('mutual_info')
# Correlation-based
corr_importance = sniffer.get_feature_importance('correlation')
# Get top features
top_features = sniffer.get_top_features(5, method='random_forest')
Data Quality Assessment
Comprehensive data quality analysis:
# Get data quality summary
quality_issues = sniffer.get_data_quality_summary()
# Access detailed quality metrics
quality_report = sniffer.data_quality_report
# Check for specific issues
missing_columns = quality_issues['high_missing']
outlier_columns = quality_issues['many_outliers']
Preprocessing Suggestions
Automated recommendations for data preparation:
suggestions = sniffer.suggest_preprocessing()
# Missing data handling
missing_suggestions = suggestions['missing_data']
# Outlier handling
outlier_suggestions = suggestions['outliers']
# Feature scaling
scaling_suggestions = suggestions['scaling']
# Categorical encoding
encoding_suggestions = suggestions['encoding']
# Feature selection
selection_suggestions = suggestions['feature_selection']
Interactive Dashboard
Create interactive Plotly dashboards:
# Create interactive dashboard
sniffer.create_interactive_dashboard()
📊 Example Output
================================================================================
ML SNIFF - ADVANCED ML PROBLEM DETECTION
================================================================================
📊 BASIC STATISTICS:
• Rows: 1,000
• Columns: 10
• Missing Data: 2.50%
• Memory Usage: 0.78 MB
• Numeric Columns: 6
• Categorical Columns: 1
📋 DATA TYPES:
• float64: 6 columns
• int64: 3 columns
• object: 1 columns
🔍 DATA QUALITY ASSESSMENT:
• High Missing: feature3
• Many Outliers: feature1, feature2
🎯 TARGET COLUMN ANALYSIS:
• Identified Target: 'target'
• Problem Type: Classification
• Suggested Model: RandomForestClassifier
• Target Statistics:
- Data Type: int64
- Unique Values: 3
- Missing Values: 0
- Mean: 1.2000
- Std: 0.8165
- Min: 0.0000
- Max: 2.0000
- Skewness: 0.0000
- Kurtosis: -1.5000
- Label Distribution:
* 0: 400 (40.0%)
* 1: 350 (35.0%)
* 2: 250 (25.0%)
🏆 FEATURE IMPORTANCE:
1. feature1: 0.3800
2. feature3: 0.2628
3. feature4: 0.2000
4. feature2: 0.1572
💡 MODEL RECOMMENDATIONS:
• Primary Model: RandomForestClassifier
• Hyperparameters: {'n_estimators': 100, 'max_depth': 10, 'random_state': 42}
• Alternative Models: LogisticRegression, SVM, XGBClassifier
• Consider class imbalance if present
• Use metrics like accuracy, precision, recall, F1-score
================================================================================
🛠️ CLI Options
ml-sniff [OPTIONS] FILE
Options:
--target, -t TEXT Manually specify target column name
--visualize, -v Show data visualizations
--interactive, -i Create interactive Plotly dashboard
--output, -o TEXT Save report to file instead of printing to console
--export, -e TEXT Export detailed analysis report to file
--format, -f [json|csv|txt] Export format (default: json)
--summary, -s Show only summary information
--preprocessing, -p Show preprocessing suggestions
--no-auto-analyze Skip automatic analysis on initialization
--feature-importance Show feature importance analysis
--data-quality Show detailed data quality report
📈 Sample Data
Create sample datasets to test the package:
import pandas as pd
import numpy as np
# Classification dataset
np.random.seed(42)
n_samples = 1000
classification_data = {
'feature1': np.random.normal(0, 1, n_samples),
'feature2': np.random.normal(0, 1, n_samples),
'feature3': np.random.normal(0, 1, n_samples),
'feature4': np.random.normal(0, 1, n_samples),
'categorical_feature': np.random.choice(['A', 'B', 'C'], n_samples),
'target': np.random.choice([0, 1, 2], n_samples, p=[0.4, 0.35, 0.25])
}
df = pd.DataFrame(classification_data)
df.to_csv('classification_sample.csv', index=False)
# Regression dataset
regression_data = {
'feature1': np.random.normal(0, 1, n_samples),
'feature2': np.random.normal(0, 1, n_samples),
'feature3': np.random.normal(0, 1, n_samples),
'target': np.random.normal(0, 1, n_samples)
}
df_reg = pd.DataFrame(regression_data)
df_reg.to_csv('regression_sample.csv', index=False)
🔬 API Reference
Sniffer Class
__init__(data, target_column=None, auto_analyze=True)
Initialize the Sniffer with data.
Parameters:
data: CSV file path (str/Path) or pandas DataFrametarget_column: Optional manual target column specificationauto_analyze: Whether to automatically analyze data on initialization
report()
Print a comprehensive analysis report to console.
get_summary()
Get analysis results as a dictionary.
Returns:
- Dictionary with keys:
target_column,problem_type,suggested_model,basic_stats,label_distribution,feature_importance,data_quality_report,outlier_info,clustering_analysis,quality_issues
get_feature_importance(method='random_forest')
Get feature importance scores.
Parameters:
method: 'random_forest', 'mutual_info', or 'correlation'
Returns:
- Dictionary of feature importance scores
get_top_features(n=5, method='random_forest')
Get top n most important features.
Parameters:
n: Number of top features to returnmethod: Feature importance method to use
Returns:
- List of top feature names
get_data_quality_summary()
Get a summary of data quality issues.
Returns:
- Dictionary with data quality summary
suggest_preprocessing()
Suggest preprocessing steps based on data analysis.
Returns:
- Dictionary with preprocessing suggestions
visualize_data(figsize=(15, 10))
Generate comprehensive data visualizations.
create_interactive_dashboard()
Create an interactive Plotly dashboard.
export_report(filename, format='json')
Export analysis report to file.
Parameters:
filename: Output filenameformat: 'json', 'csv', or 'txt'
🧪 Development
Setup Development Environment
git clone https://github.com/ml-sniff/ml-sniff.git
cd ml-sniff
pip install -e ".[dev]"
Run Tests
pytest tests/
Code Formatting
black ml_sniff/
flake8 ml_sniff/
📋 Dependencies
- pandas >= 1.3.0
- numpy >= 1.20.0
- matplotlib >= 3.3.0
- seaborn >= 0.11.0
- scikit-learn >= 1.0.0
- scipy >= 1.7.0
- plotly >= 5.0.0
🚀 Roadmap
- Support for more file formats (Excel, JSON, etc.)
- Advanced feature engineering suggestions
- Model performance estimation
- Integration with popular ML libraries
- Web interface
- Batch processing capabilities
- Time series analysis
- Anomaly detection
- AutoML integration
🤝 Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🆘 Support
If you encounter any issues or have questions, please:
- Check the documentation
- Search existing issues
- Create a new issue
Made with ❤️ for the ML community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ml_sniff-1.0.0.tar.gz.
File metadata
- Download URL: ml_sniff-1.0.0.tar.gz
- Upload date:
- Size: 32.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e6b2ce7c8d02074cf4b83cbc77819a614c90d8ae0c66e72c7f47178fa6a3ad9
|
|
| MD5 |
02bdc5e6dade85334b9f0722b786efb2
|
|
| BLAKE2b-256 |
d381bd06d2bc3480c5c6c585cc36bd5101113ee0ea1058b73809677c8bd64bfd
|
File details
Details for the file ml_sniff-1.0.0-py3-none-any.whl.
File metadata
- Download URL: ml_sniff-1.0.0-py3-none-any.whl
- Upload date:
- Size: 26.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40dfb8fd4d7b0abe27eab1f58aabb22489f87338321471da9bd4669a589c3026
|
|
| MD5 |
8c5f56cdf46de76b63be53663b9f4a2c
|
|
| BLAKE2b-256 |
f6889aa371a6bfb22e016bcb845bba0eb64c04d25a2aa1f55fbeeef320e78df9
|