Model Evaluation Toolkit — bias-variance analysis, ROC curves, model summary & multi-model comparison
Project description
ml-evaluator · Model Evaluation Toolkit
Stop copy-pasting evaluation code. One import, one call.
pip install ml-evaluator
What is it?
ml-evaluator turns model evaluation from a chore into a single function call.
Every function is standalone — pass your model, X_test, and y_test and you're done. Works for binary and multiclass problems automatically.
Built for people who:
- know how to train a model but want clean, reproducible evaluation
- are learning ML and want plots + plain-English interpretation, not just numbers
- are tired of writing the same confusion-matrix / ROC / bias-variance code every project
Install
pip install ml-evaluator
import ml_evaluator as ev
Requirements: Python 3.8+ · numpy · pandas · matplotlib · scikit-learn
API
Every piece of evaluation is its own function — use exactly what you need.
🔹 Single model — text output
ev.metrics(model, X_test, y_test) # numbers only, no plot
ev.interpret(model, X_test, y_test) # plain-English interpretation, no plot
ev.classification_report(model, X_test, y_test) # classification report table, no plot
Example output — ev.metrics()
=======================================================
Model Summary — Random Forest
=======================================================
Accuracy : 0.9625
F1 : 0.9634
Precision : 0.9518
Recall : 0.9753
ROC-AUC : 0.9930
Example output — multiclass ev.metrics()
=======================================================
Model Summary — RF (4-class)
Task: Multiclass (4 classes: 0, 1, 2, 3)
=======================================================
Accuracy : 0.8708
F1 : 0.8524 (macro avg)
Precision : 0.8601 (macro avg)
Recall : 0.8490 (macro avg)
ROC-AUC : 0.9712 (macro OvR)
🔹 Single model — individual plots
ev.plot_confusion_matrix(model, X_test, y_test) # confusion matrix only
ev.plot_roc_curve(model, X_test, y_test) # ROC curve only
ev.plot_metrics_bar(model, X_test, y_test) # metrics bar chart only
🔹 Single model — all-in-one
ev.model_summary(model, X_test, y_test)
Produces a 2×2 dashboard in one figure:
┌─────────────────────┬─────────────────────┐
│ A · Confusion │ B · ROC Curve │
│ Matrix │ (OvR if multiclass│
├─────────────────────┼─────────────────────┤
│ C · Metrics │ D · Classification │
│ Bar Chart │ Report │
└─────────────────────┴─────────────────────┘
Also prints metrics + interpretation to the terminal.
🔸 Bias–Variance — single model
ev.bv_stats(model, X_train, y_train) # stats + diagnosis only, no plot
ev.plot_learning_curve(model, X_train, y_train) # learning curve plot only
ev.bias_variance(model, X_train, y_train) # stats + plot together
Example output — ev.bv_stats()
✅ Random Forest
Train Acc : 1.0000
Val Acc : 0.9400
Gap : 0.0600 (threshold: 0.10)
Val Std : 0.0151 (variance proxy)
Diagnosis : Good Fit
Train–val gap is 0.060 and val error is 0.060.
The model generalises well.
→ Next step: evaluate on the held-out test set.
Diagnosis logic:
| Condition | Diagnosis | What it means |
|---|---|---|
train − val gap > 0.10 |
🔴 Overfit | Model memorises training data, fails to generalise |
1 − val_accuracy > 0.15 |
🟡 Underfit | Model too simple to capture the pattern |
| otherwise | ✅ Good Fit | Model generalises well |
🔸 Multiple models — text output
ev.compare_metrics(models, X_test, y_test) # metrics table + winner per metric
ev.compare_interpret(models, X_test, y_test) # interpretation per model
Example output — ev.compare_metrics()
=================================================================
Model Comparison
=================================================================
Model Accuracy F1 Precision Recall ROC-AUC
──────────────────────────────────────────────────────────────
Random Forest 0.9625 0.9634 0.9518 0.9753 0.9930
Logistic Regression 0.9375 0.9412 0.8989 0.9877 0.9833
Winners by metric:
Accuracy → Random Forest (0.9625)
Recall → Logistic Regression (0.9877)
✅ Overall recommendation: Random Forest
🔸 Multiple models — individual plots
ev.plot_confusion_matrices(models, X_test, y_test) # one matrix per model
ev.compare_roc_curves(models, X_test, y_test) # overlaid ROC curves
ev.plot_metrics_comparison(models, X_test, y_test) # grouped bar chart
🔸 Multiple models — all-in-one shortcuts
ev.comparison_dashboard(models, X_test, y_test)
ev.compare_bias_variance(models, X_train, y_train)
🔺 Data utilities
ev.is_imbalanced(y_train)
ev.is_imbalanced(y_train, threshold=0.15, class_labels=["Not Churned", "Churned"])
Detects class imbalance, prints a distribution table with a diagnosis, and draws a bar chart + pie chart.
Example output — ev.is_imbalanced()
=======================================================
Class Distribution
=======================================================
Class Count %
──────────────────────────────────────
0 850 85.0% █████████████████████
1 150 15.0% ███
Min / Max ratio : 0.176 (threshold: 0.20)
🔴 Dataset is IMBALANCED
⚠️ Accuracy can be misleading on imbalanced data.
→ Use F1, Precision, Recall, or ROC-AUC instead.
→ Consider: class_weight='balanced', SMOTE, or threshold tuning.
Multiclass support
All functions automatically detect binary vs. multiclass — no extra parameters needed.
| What changes | Binary | Multiclass |
|---|---|---|
| F1 / Precision / Recall | average="binary" |
average="macro" |
| ROC-AUC | from predict_proba[:, 1] |
One-vs-Rest, macro |
| ROC Curve | single curve | one per class + macro avg |
| Confusion Matrix | 2×2 | N×N, auto-scaled |
| Interpretation | precision vs recall balance | flags classes with low F1 |
Controlling interpretation output
Every function that prints results accepts show_interpretation=True/False:
# Print interpretation alongside metrics
ev.metrics(rf, X_test, y_test, show_interpretation=True)
# Skip interpretation — just the numbers
ev.bv_stats(rf, X_train, y_train, show_interpretation=False)
# model_summary without the interpretation block
ev.model_summary(rf, X_test, y_test, show_interpretation=False)
Defaults:
| Function | show_interpretation default |
|---|---|
model_summary, bv_stats, bias_variance, plot_roc_curve |
True |
metrics, plot_confusion_matrix, plot_metrics_bar, compare_metrics, comparison_dashboard |
False |
All parameters
Single model
ev.metrics(model, X_test, y_test,
model_name="Model",
verbose=True,
show_interpretation=False,
)
ev.plot_confusion_matrix(model, X_test, y_test,
model_name="Model",
class_labels=None, # e.g. ["Not Churned", "Churned"]
color="#2E75B6",
figsize=None, # auto-scaled based on number of classes
show_interpretation=False,
save_path=None,
)
ev.plot_roc_curve(model, X_test, y_test,
model_name="Model",
color="#2E75B6",
figsize=(6, 5),
show_interpretation=True,
save_path=None,
)
ev.plot_metrics_bar(model, X_test, y_test,
model_name="Model",
color="#2E75B6",
figsize=(7, 4),
show_interpretation=False,
save_path=None,
)
ev.model_summary(model, X_test, y_test,
model_name="Model",
class_labels=None,
color="#2E75B6",
figsize=(14, 10),
show_interpretation=True,
save_path=None,
)
Bias–Variance
ev.bv_stats(model, X_train, y_train,
model_name="Model",
n_splits=5,
random_state=42,
train_sizes=None, # default: linspace(0.1, 1.0, 8)
scoring="accuracy",
overfit_threshold=0.10,
high_bias_threshold=0.15,
verbose=True,
show_interpretation=True,
)
# bias_variance() and plot_learning_curve() accept the same parameters
Multi-model functions
# All multi-model functions accept:
models = {"name": fitted_estimator, ...} # required
colors = None # list of colours, one per model
figsize = None # auto-sized if not given
save_path = None
# comparison_dashboard and plot_confusion_matrices also accept:
class_labels = None
show_interpretation = False # comparison_dashboard only
# compare_metrics also accepts:
show_interpretation = False
Data utilities
ev.is_imbalanced(y,
threshold=0.20, # minority/majority ratio below which → imbalanced
class_labels=None,
show_interpretation=True,
figsize=(10, 4),
save_path=None,
)
Return values
| Function | Returns |
|---|---|
metrics() |
Result dict — accuracy, f1, precision, recall, roc_auc, y_pred, y_prob, y_prob_multi, classes, multiclass, averaging, report |
interpret() |
Result — {"text": ...} |
bv_stats() / bias_variance() |
Result dict — mean_train, mean_val, gap, bias_proxy, val_std, diagnosis, explanation, lc_sizes, lc_train, lc_val |
compare_metrics() |
Result (DataFrame-like) — df["F1"], df["F1"].idxmax(), etc. |
compare_interpret() |
Result — {model_name: text} |
compare_bias_variance() |
Result of Results — cb["RF"]["diagnosis"], cb["LR"]["gap"], etc. |
all plot_* and *_dashboard functions |
None |
is_imbalanced() |
None |
Note: Result objects are silent in Jupyter — they never auto-display. Access data with
m["accuracy"],bv["gap"], etc.
Roadmap
- v1.6 — Threshold optimiser (best decision threshold for F1 / Recall)
- v1.7 — Probability calibration plot (reliability diagram)
- v1.8 — Feature importance comparison across models
- v2.0 — Auto PDF report
Contributing
Issues and pull requests are welcome.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ml_evaluator-1.5.tar.gz.
File metadata
- Download URL: ml_evaluator-1.5.tar.gz
- Upload date:
- Size: 27.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb192986b4bed32a9be69878296a25dd35e953f2116556fffbc6de40fd7869d7
|
|
| MD5 |
82b50c74206b66972b1f7f8da80afc91
|
|
| BLAKE2b-256 |
c993ee2825a7b527430a7877370a597cdb41e8f74d199a4548ac028ddcaa5828
|
File details
Details for the file ml_evaluator-1.5-py3-none-any.whl.
File metadata
- Download URL: ml_evaluator-1.5-py3-none-any.whl
- Upload date:
- Size: 27.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd09c2f038e01be45e66a8bfa400ca7e3ff6e9ec4fa6d8e14055fd7abb1b15ec
|
|
| MD5 |
52a7d8d56507ab0ccd69ca1e6b5d38a3
|
|
| BLAKE2b-256 |
d11f03670df10f57f21b070382cedd4c8356a19a151ef1a3b34f8ac75539de8d
|