An intelligent Python library that analyzes datasets before training machine learning models
Project description
PreMLCheck
An intelligent Python library that analyzes datasets before training machine learning models.
PreMLCheck acts as your pre-training ML advisor โ it helps you understand your data, detect potential problems, and make informed machine learning decisions before you waste time on training.
One Line Summary: PreMLCheck analyzes your dataset and tells you everything you need to know before you start training machine learning models.
๐ Project Structure
PreMLCheck-Library/
โ
โโโ premlcheck/ # Main package
โ โโโ __init__.py # Package initialization & public API
โ โโโ analyzer.py # Main PreMLCheck orchestrator class
โ โโโ config.py # Configuration defaults & constants
โ โโโ task_detector.py # Module 1: Detect ML task type
โ โโโ quality_checker.py # Module 2: Dataset quality assessment
โ โโโ overfitting_predictor.py # Module 3: Overfitting risk prediction
โ โโโ model_recommender.py # Module 4: ML model recommendations
โ โโโ performance_estimator.py # Module 5: Performance estimation
โ โโโ preprocessing_advisor.py # Module 6: Preprocessing suggestions
โ โโโ report_generator.py # Module 7: Report generation (MD/HTML/JSON)
โ โ
โ โโโ utils/ # Utility helpers
โ โโโ __init__.py # Utils package exports
โ โโโ metrics.py # Metric calculations & data statistics
โ โโโ validators.py # Input validation functions
โ โโโ visualizers.py # Visualization utilities (optional)
โ
โโโ tests/ # Test suite
โ โโโ __init__.py
โ โโโ test_task_detector.py
โ โโโ test_quality_checker.py
โ โโโ test_overfitting_predictor.py
โ โโโ test_model_recommender.py
โ โโโ test_performance_estimator.py
โ โโโ test_preprocessing_advisor.py
โ โโโ test_report_generator.py
โ โโโ test_integration.py # End-to-end integration tests
โ
โโโ examples/ # Usage examples
โ โโโ basic_usage.py
โ โโโ sample_datasets/
โ โโโ classification_sample.csv
โ โโโ regression_sample.csv
โ
โโโ docs/ # Documentation
โ โโโ API.md # Full API reference
โ โโโ CHANGELOG.md
โ โโโ CONTRIBUTING.md
โ
โโโ setup.py # Package setup (setuptools)
โโโ pyproject.toml # PEP 517/518 build configuration
โโโ requirements.txt # Core dependencies
โโโ requirements-dev.txt # Development dependencies
โโโ MANIFEST.in # Distribution manifest
โโโ LICENSE # MIT License
โโโ README.md # This file
โโโ BUILD_AND_PUBLISH.md # PyPI publishing guide
โโโ PYPI_CHECKLIST.md # Pre-publish checklist
โโโ verify_package.py # Package verification script
โโโ .gitignore
๐ Features
PreMLCheck runs 7 analysis modules on your dataset in a single call:
1. Detect ML Task Type
Automatically identifies whether your problem is Classification or Regression by analyzing the target variable's data type, number of unique values, and distribution. Returns a confidence score (0โ1).
2. Check Dataset Quality
Calculates a Dataset Health Score (0โ100) by examining:
- Missing values โ percentage of null/NaN cells across all columns
- Class imbalance โ ratio between majority and minority classes (classification only)
- Feature redundancy โ highly correlated feature pairs (Pearson > 0.95)
- Sample-to-feature ratio โ whether you have enough rows for the number of columns
3. Predict Overfitting Risk
Estimates overfitting risk as Low, Medium, or High based on:
- Sample-to-feature ratio
- Dataset size relative to complexity
- High-dimensional features
- Missing data patterns
- Feature correlation structure
Each risk factor is listed with a description and severity.
4. Recommend Best ML Models
Suggests the most suitable algorithms based on your dataset's characteristics:
- Dataset size (small / medium / large)
- Dimensionality (few features vs. high-dimensional)
- Task type (classification or regression)
- Class imbalance level
Models are scored and ranked by suitability with reasons for each recommendation.
5. Estimate Expected Performance
Predicts approximate accuracy or error range before full training by:
- Training lightweight baseline models (Decision Tree)
- Running cross-validation (5-fold by default)
- Computing confidence intervals and bounds
- Classification: accuracy, precision, recall, F1-score
- Regression: MSE, RMSE, MAE, Rยฒ
6. Give Preprocessing Suggestions
Recommends specific preprocessing steps with priority levels (High / Medium / Low) and ready-to-use code examples:
- Missing value imputation strategies
- Feature scaling (StandardScaler, MinMaxScaler)
- Feature selection for high-dimensional data
- Outlier detection and handling
- Class imbalance techniques (SMOTE, class weights)
- Categorical encoding (One-Hot, Label Encoding)
7. Generate Comprehensive Reports
Exports the full analysis as a formatted report in:
- Markdown (
.md) โ for GitHub/documentation - HTML (
.html) โ for sharing/viewing in browsers - JSON (
.json) โ for programmatic consumption
๐ How It Works โ Analysis Flow
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Your Dataset (pandas DataFrame)โ
โ + Target Column Name โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 1: TaskDetector โ
โ โ Classification or Regression? โ
โ โ Confidence Score โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 2: QualityChecker โ
โ โ Health Score (0-100) โ
โ โ Missing values, imbalance, โ
โ redundancy, ratio details โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 3: OverfittingPredictor โ
โ โ Risk Level (Low/Medium/High) โ
โ โ Contributing factors list โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 4: ModelRecommender โ
โ โ Ranked list of suitable modelsโ
โ โ Suitability scores & reasons โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 5: PerformanceEstimator โ
โ โ Baseline performance metrics โ
โ โ Confidence intervals โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 6: PreprocessingAdvisor โ
โ โ Prioritized suggestions โ
โ โ Code examples for each step โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 7: ReportGenerator โ
โ โ Markdown / HTML / JSON output โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฆ Installation
From PyPI (when published)
pip install premlcheck
From Source
git clone https://github.com/MudassarGill/PreMLCheck-Library.git
cd PreMLCheck-Library
pip install -e .
With Visualization Support
pip install premlcheck[viz]
This installs optional dependencies (matplotlib, seaborn) for charts and plots.
๐ฏ Quick Start
import pandas as pd
from premlcheck import PreMLCheck
# Load your dataset
df = pd.read_csv('your_dataset.csv')
# Initialize the analyzer
analyzer = PreMLCheck()
# Run the full analysis
results = analyzer.analyze(df, target_column='target')
# Print a human-readable summary
print(results.summary())
Example Output:
========== PreMLCheck Analysis Summary ==========
Task Type: classification (confidence: 0.95)
Dataset Quality Score: 78.5/100
Overfitting Risk: Medium
Top Model Recommendations:
1. Random Forest (score: 92)
2. Gradient Boosting (score: 88)
3. Logistic Regression (score: 75)
Preprocessing Suggestions: 4 suggestions
- [HIGH] Handle missing values using median imputation
- [HIGH] Apply StandardScaler to numeric features
- [MEDIUM] Address class imbalance with SMOTE
- [LOW] Consider feature selection (high dimensionality)
================================================
๐ Generating Reports
# Generate a Markdown report
analyzer.generate_report(results, 'analysis_report.md', format='markdown')
# Generate an HTML report
analyzer.generate_report(results, 'analysis_report.html', format='html')
# Generate a JSON report
analyzer.generate_report(results, 'analysis_report.json', format='json')
๐ง Custom Configuration
You can override default thresholds to suit your needs:
config = {
'quality_thresholds': {
'missing_values_max': 0.2, # Flag if >20% missing
'imbalance_ratio_max': 5.0, # Flag if ratio >5:1
'correlation_threshold': 0.90, # Flag if correlation >0.90
},
'overfitting_thresholds': {
'sample_to_feature_ratio_low': 10, # Flag if <10 samples per feature
}
}
analyzer = PreMLCheck(config=config)
results = analyzer.analyze(df, target_column='target')
See premlcheck/config.py for all available configuration options.
๐ Utility Functions
PreMLCheck also exposes utility functions you can use independently:
Validators
from premlcheck.utils import validate_dataframe, validate_target_column
validate_dataframe(df, min_rows=10) # Raises if invalid
validate_target_column(df, 'target') # Raises if column missing
Metrics
from premlcheck.utils import (
calculate_metrics,
calculate_class_balance_score,
calculate_feature_correlation_stats,
calculate_missing_value_profile,
calculate_outlier_stats,
)
# Classification/regression metrics
metrics = calculate_metrics(y_true, y_pred, task_type='classification')
# Class balance analysis
balance = calculate_class_balance_score(y)
# Outlier detection stats
outliers = calculate_outlier_stats(X)
Visualizations (requires pip install premlcheck[viz])
from premlcheck.utils import (
plot_feature_importance,
plot_correlation_matrix,
plot_target_distribution,
plot_missing_values,
plot_quality_radar,
plot_model_comparison,
)
fig, ax = plot_correlation_matrix(df)
fig, ax = plot_missing_values(df)
fig, ax = plot_quality_radar(results.quality_details)
fig, ax = plot_model_comparison(results.model_recommendations)
๐งช Running Tests
Run the full test suite (36 unit + integration tests):
python -m pytest tests/ -v --tb=short -o addopts=""
Expected result:
36 passed in ~4s
๐ Documentation
| Document | Description |
|---|---|
| API Reference | Full API documentation for all classes and functions |
| Contributing | Guidelines for contributing to PreMLCheck |
| Changelog | Version history and release notes |
| Build & Publish | Guide for building and publishing to PyPI |
| Examples | Working code examples |
๐ Tech Stack
| Dependency | Purpose |
|---|---|
pandas |
DataFrame handling and data manipulation |
numpy |
Numerical computations |
scikit-learn |
ML models, metrics, and cross-validation |
scipy |
Statistical analysis |
matplotlib (optional) |
Plotting and charts |
seaborn (optional) |
Statistical visualizations |
๐ License
MIT License โ see LICENSE file for details.
๐ค Author
Mudassar Hussain
| ๐ง Email | mudassarhussain6533@gmail.com |
| ๐ GitHub | @MudassarGill |
| ๐ผ LinkedIn | mudassar65 |
If you find PreMLCheck useful, please โญ star the repository!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file premlcheck-0.1.0.tar.gz.
File metadata
- Download URL: premlcheck-0.1.0.tar.gz
- Upload date:
- Size: 610.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0bc5dde12dde5bcb6014f1c15427e76e4300721d5d2a825fac1b77375454378e
|
|
| MD5 |
8c9a1b7067165cbfffd59b7d0e339b80
|
|
| BLAKE2b-256 |
f2be67be2bb6c9b4795c9ba837089d9c30c85b91740566d1694dd411b3f89247
|
File details
Details for the file premlcheck-0.1.0-py3-none-any.whl.
File metadata
- Download URL: premlcheck-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1d1fd30f29517b4810153e68762abd0ba22535c3a7810c845d8a0b60cfdf1fd
|
|
| MD5 |
f957150529cb5f1d3e073ec6a1e58a5b
|
|
| BLAKE2b-256 |
179dbcbc468186b8900e396950e38cf6cad7c6f50744f075300abfa06c19eaa4
|