Static CLI linter for silent methodological errors in scikit-learn workflows
Project description
MLGuard
A lightweight static analyzer for silent methodological errors in scikit-learn ML workflows — data leakage, invalid evaluation, and unsound train/test splits that run without raising an error but quietly inflate your results.
Zero runtime dependencies (Python standard library only). MLGuard is a heuristic surfacing tool,
not a prover: every diagnostic carries a confidence (high / medium / low) and is meant as a
code-review prompt, not a proof of a bug.
Install
pip install mlguard-lint
That's it — same command on Linux, macOS, and Windows. Requires Python 3.9+.
Use it
Point it at a single file or a whole folder:
mlguard-lint notebook.ipynb # scan one notebook or .py file
mlguard-lint src/ # scan a directory (recursive)
By default you get one clean line per issue:
mlguard-lint — notebook.ipynb
notebook.ipynb
✗ line 5 MLG001 Transformer fitted before split
⚠ line 6 MLG006 Missing random_state
⚠ line 6 MLG005 Classification split without stratify
3 issue(s) in 1 of 1 file(s) · 1 critical, 2 warnings
Tip: add --explain for why each matters and how to fix.
Options
mlguard-lint notebook.ipynb --explain # add the code, why it matters, and how to fix it
mlguard-lint src/ --summary # one line per file (handy for large folders)
mlguard-lint . --fail-on critical # exit code 2 on any critical finding (CI gate)
mlguard-lint . --json out.json # machine-readable output
mlguard-lint notebook.ipynb --no-color # plain text (colors auto-off when piped)
If the mlguard-lint command isn't on your PATH, the module form always works:
python -m mlguard_lint notebook.ipynb
Windows: use Windows Terminal or PowerShell so the
✗ ⚠symbols and colors render correctly. On the legacycmd.execonsole, pass--no-color(or runchcp 65001once for UTF-8).
Rules
- MLG001 — Transformer fitted before split
- MLG002 — Preprocessing outside cross-validation
- MLG003 — Model evaluated on training data
- MLG004 — Resampling before split/CV or outside an imblearn Pipeline
- MLG005 — Classification split without
stratify - MLG006 — Missing
random_state - MLG007 — Possible group/entity leakage
- MLG008 —
GridSearchCVbest_score_reported as final performance - MLG009 — Ordinal/label encoding of a nominal feature
- MLG010 — Test set reused multiple times
- MLG011 — Possible target/mean encoding without cross-fitting
- MLG012 — No independent test set
- MLG013 — Resampling applied to the test/validation set
- MLG014 — Random split/CV on time-ordered data
- MLG015 — Target column included in features
- MLG016 — Transformer re-fit on the test set
- MLG017 — Probability metric given hard predictions
- MLG018 — Feature built from dataset-wide statistics before split
- MLG019 — Misleading micro-average on imbalanced multiclass
- MLG020 — ID/source column used as a feature
- MLG021 — Rows duplicated/upsampled before split
How it works
The scanner concatenates a notebook's code cells into a single module, parses it with the standard
ast module, performs one walk to collect calls, assignments, model fit arguments, and id-like
string constants, then runs ordered rule blocks that emit diagnostics. Because the whole notebook is
analyzed as one program (no per-cell execution semantics), cross-cell dataflow is approximate and
line numbers are notebook-global.
Limitation
This is a heuristic static analyzer. It is useful for surfacing risks, not for proving that every warning is a real bug. Treat diagnostics as prompts for a closer look during code review.
Development
git clone <repo> && cd mlguard
pip install -e ".[dev]" # editable install + pytest/build/twine
python -m pytest tests/test_rules.py -q
Each rule has a synthetic fixture under tests/notebooks/ plus clean controls that must stay silent.
License
MIT — see LICENSE.
The methodology behind the rules is documented in docs/Silent-Methodological-Errors-in-scikit-learn-Workflows.pdf.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlguard_lint-0.1.1.tar.gz.
File metadata
- Download URL: mlguard_lint-0.1.1.tar.gz
- Upload date:
- Size: 19.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb348f7f0644a6a0be842bcfcddbe6d521f5954472e9f809053e9c385eb50e58
|
|
| MD5 |
e4e7dba1d925bd5f3a58c19a499ce4bb
|
|
| BLAKE2b-256 |
42eb7ec27037139c5f8d662decaf8df7a24957c01959bc5c39212dc95040d4cd
|
File details
Details for the file mlguard_lint-0.1.1-py3-none-any.whl.
File metadata
- Download URL: mlguard_lint-0.1.1-py3-none-any.whl
- Upload date:
- Size: 17.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55524ebed6f674763aacd7de7582edba33004c5a3591492d5629d94c6be5e290
|
|
| MD5 |
8028094b2c6d61cd891a2b27fb4ae04c
|
|
| BLAKE2b-256 |
d1cd29e9b65414373e92d33f3fe4c928093c214c0499d1b8a9a282750f5de6fa
|