The missing middle layer between EDA and AutoML - deep data understanding meets model comparison
Project description
DissectML
The missing middle layer between EDA and AutoML.
Deep data understanding meets model comparison -- the full journey from "What is my data?" to "Which model is best and WHY?", in as few as 3 function calls.
Quick Start | Features | Installation | Documentation | Contributing
Why DissectML?
Most data science workflows look the same: run pandas-profiling for a quick summary, switch to scikit-learn for preprocessing, try a handful of models with PyCaret or LazyPredict, then stitch SHAP plots together in a notebook. By the time you have answers, you have imported 3-5 separate libraries, written hundreds of lines of glue code, and lost the thread that connects your data findings to your modelling decisions.
DissectML (dissectml) closes that gap. It is a single, unified pipeline that
runs deep exploratory data analysis, pre-model intelligence checks (leakage
detection, readiness scoring, algorithm recommendations), a multi-model battle
arena, cross-model statistical comparison, and publication-ready HTML report
generation -- all driven by a consistent API. Three function calls replace three
notebooks.
Key Features
Exploratory Data Analysis
-
Unified correlation matrix -- Pearson, Cramer's V, and point-biserial correlation computed together and rendered in a single heatmap, regardless of column types.
-
Missing data intelligence -- Little's MCAR test plus MAR/MNAR classification, with automatic imputation strategy recommendations tailored to each column.
-
Statistical test battery -- Normality, independence, and variance tests auto-selected based on data type and sample size. No manual test selection required.
-
Auto cluster discovery -- K-Means and DBSCAN with automatically tuned parameters (elbow method, silhouette scoring) to surface natural groupings in your data.
-
Feature interaction and non-linearity detection -- Identifies non-linear relationships and interaction effects that linear models would miss.
Pre-Model Intelligence
-
Target leakage detection -- Four-pronged analysis covering correlation leakage, mutual information leakage, temporal leakage, and derived-feature leakage.
-
Data readiness score -- A 0-100 composite score with waterfall breakdown showing exactly what is dragging your data quality down (missing values, cardinality, class balance, outliers, and more).
-
Algorithm recommendations -- A rules engine that maps your EDA findings (data size, feature types, non-linearity, multicollinearity) to a ranked list of recommended model families.
Model Comparison
-
36-model battle arena -- 19 classifiers and 17 regressors (plus optional XGBoost, LightGBM, and CatBoost) trained and evaluated with parallel cross-validation in a single call.
-
Cross-model error analysis -- Identifies the hardest samples, builds a model complementarity matrix, and highlights where ensemble strategies could improve performance.
-
Statistical significance testing -- McNemar's test for classifiers and corrected repeated k-fold paired t-test for regressors, so you know which performance differences are real.
Reporting
- Publication-ready HTML reports -- Interactive Plotly charts, narrative summaries, and structured sections covering every stage of the pipeline, exportable as a single self-contained HTML file.
Quick Start
import dissectml as dml
# Load a built-in dataset
df = dml.load_titanic()
1. Deep Exploratory Data Analysis
eda = dml.explore(df)
eda.overview.show() # Shape, dtypes, memory usage
eda.correlations.heatmap() # Unified correlation matrix
eda.missing.patterns() # Missing data analysis with MCAR test
eda.outliers.plot() # Outlier detection across numeric columns
eda.clusters.summary() # Auto-discovered clusters
2. Model Battle Arena
models = dml.battle(df, target="survived")
models.leaderboard() # Ranked models with CV scores
models.timing() # Training time comparison
3. Full Pipeline (EDA + Intelligence + Battle + Compare + Report)
report = dml.analyze(df, target="survived", task="classification")
report.summary() # High-level findings
report.export("report.html") # Self-contained interactive report
The analyze function runs all five stages end-to-end: EDA, intelligence
checks, model training, cross-model comparison, and report generation. For
fine-grained control, call each stage individually.
Installation
Core package
pip install dissectml
Optional extras
pip install dissectml[boost] # XGBoost, LightGBM, CatBoost
pip install dissectml[explain] # SHAP explainability
pip install dissectml[report] # PDF export (WeasyPrint + Kaleido)
pip install dissectml[scale] # Polars backend + Optuna tuning
pip install dissectml[full] # Everything above
Development
git clone https://github.com/rupeshbharambe24/dissectML.git
cd DissectML
pip install -e ".[dev]"
Requirements: Python 3.10 or later.
Comparison with Alternatives
| Feature | DissectML | PyCaret | LazyPredict | YData Profiling |
|---|---|---|---|---|
| Deep EDA | Yes | -- | -- | Yes |
| Statistical Tests | Yes | -- | -- | Partial |
| Model Training | Yes | Yes | Yes | -- |
| Model Comparison | Yes | Yes | Partial | -- |
| SHAP Analysis | Yes | Yes | -- | -- |
| Interactive Reports | Yes | -- | -- | Yes |
| Target Leakage Detection | Yes | -- | -- | -- |
| Data Readiness Score | Yes | -- | -- | -- |
DissectML is the only library that covers the full spectrum from statistical data profiling through model comparison with a single, coherent API. Other tools excel at individual stages but leave you to bridge the gaps yourself.
Architecture
DissectML is organized into five pipeline stages, each backed by a dedicated subpackage:
Stage 1: EDA dissectml.eda 9 sub-modules (overview, correlations,
missing, outliers, univariate, bivariate,
clusters, interactions, statistical_tests)
Stage 2: Intelligence dissectml.intelligence Leakage detection, multicollinearity,
feature importance, readiness scoring,
algorithm recommendations
Stage 3: Battle dissectml.battle Model catalog, preprocessing pipeline,
parallel CV runner, hyperparameter tuner
Stage 4: Compare dissectml.compare Metrics tables, significance tests,
error analysis, Pareto frontiers,
ROC/PR curves, SHAP comparison
Stage 5: Report dissectml.report Jinja2 HTML builder, narrative generator,
section renderers, PDF export
Configuration
DissectML uses a global configuration object for controlling default behavior:
import dissectml as dml
# View current config
print(dml.get_config())
# Temporarily override settings
with dml.config_context(n_jobs=4, cv_folds=10):
report = dml.analyze(df, target="price")
Built-in Datasets
Two datasets are bundled for quick experimentation:
df_titanic = dml.load_titanic() # Binary classification (survival)
df_housing = dml.load_housing() # Regression (house prices)
Documentation
Full documentation, API reference, and tutorials are available at:
https://dissectml.readthedocs.io
Contributing
Contributions are welcome. Please see CONTRIBUTING.md for guidelines on setting up a development environment, running the test suite, and submitting pull requests.
If you find a bug or have a feature request, please open an issue on the GitHub issue tracker.
License
DissectML is released under the MIT License.
Built by Rupesh Bharambe
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dissectml-0.1.2.tar.gz.
File metadata
- Download URL: dissectml-0.1.2.tar.gz
- Upload date:
- Size: 154.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5d37e5c4a6ab24d596b9179f40e506a4f9c9b8b3bb231766380fadac8ffc647
|
|
| MD5 |
e5e1eed54ea75d88dd2edb4f13d15d9a
|
|
| BLAKE2b-256 |
ed0d7c5ed678bb4c0b2c551bd31bf85d2a8ca152a547718f461d02e799029805
|
Provenance
The following attestation bundles were made for dissectml-0.1.2.tar.gz:
Publisher:
release.yml on rupeshbharambe24/dissectML
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dissectml-0.1.2.tar.gz -
Subject digest:
f5d37e5c4a6ab24d596b9179f40e506a4f9c9b8b3bb231766380fadac8ffc647 - Sigstore transparency entry: 1246154286
- Sigstore integration time:
-
Permalink:
rupeshbharambe24/dissectML@fdd4d59da5c282d1a505ca5c514dbc0e9287b189 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/rupeshbharambe24
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@fdd4d59da5c282d1a505ca5c514dbc0e9287b189 -
Trigger Event:
push
-
Statement type:
File details
Details for the file dissectml-0.1.2-py3-none-any.whl.
File metadata
- Download URL: dissectml-0.1.2-py3-none-any.whl
- Upload date:
- Size: 150.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9ed0a60f68e91e1a7e1604914efa13aa9da34de42a9baf4b2f715ce2bacfd72
|
|
| MD5 |
f61c48e17f18ff994e9a8f7b647aaff3
|
|
| BLAKE2b-256 |
5a7a90f6edc5d55b4ec7f1582d8cc656754622376bf86b67489d2292019a4146
|
Provenance
The following attestation bundles were made for dissectml-0.1.2-py3-none-any.whl:
Publisher:
release.yml on rupeshbharambe24/dissectML
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dissectml-0.1.2-py3-none-any.whl -
Subject digest:
d9ed0a60f68e91e1a7e1604914efa13aa9da34de42a9baf4b2f715ce2bacfd72 - Sigstore transparency entry: 1246154288
- Sigstore integration time:
-
Permalink:
rupeshbharambe24/dissectML@fdd4d59da5c282d1a505ca5c514dbc0e9287b189 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/rupeshbharambe24
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@fdd4d59da5c282d1a505ca5c514dbc0e9287b189 -
Trigger Event:
push
-
Statement type: