Model Evaluation Toolkit — bias-variance analysis, ROC curves, model summary & multi-model comparison
Project description
What is it?
ml-evaluator turns model evaluation from a chore into a single function call.
Every function is standalone — pass your model, X_test, and y_test and you're done. No pipelines to build, no intermediate objects to store, no boilerplate to copy.
Built for people who:
- know how to train a model but want clean, reproducible evaluation
- are learning ML and want plots + plain-English interpretation, not just numbers
- are tired of writing the same confusion-matrix / ROC / bias-variance code every project
Install
pip install ml-evaluator
import ml_evaluator as ev
Requirements: Python 3.8+ · numpy · pandas · matplotlib · scikit-learn
API
Every piece of evaluation is its own function — use exactly what you need.
🔹 Single model — text output
ev.metrics(model, X_test, y_test) # numbers only, no plot
ev.interpret(model, X_test, y_test) # plain-English interpretation, no plot
ev.classification_report(model, X_test, y_test) # classification report table, no plot
Example output — ev.metrics()
=======================================================
Model Summary — Random Forest
=======================================================
Accuracy : 0.9625
F1 : 0.9634
Precision : 0.9518
Recall : 0.9753
ROC-AUC : 0.9930
Example output — ev.interpret()
Interpretation — Random Forest
─────────────────────────────────────────────────────
✅ Accuracy 0.963 — high overall correctness.
✅ F1 0.963 — strong balance between precision and recall.
✅ Precision (0.952) ≈ Recall (0.975) — well balanced.
AUC = 0.993 — Excellent. The model clearly separates the two classes.
Example output — ev.classification_report()
Classification Report — Random Forest
────────────────────────────────────────────────────────
Class Precision Recall F1-Score Support
────────────────────────────────────────────────────────
0 0.970 0.952 0.961 100
1 0.953 0.971 0.962 100
────────────────────────────────────────────────────────
Accuracy 0.963 200
macro avg 0.961 0.962 0.961
weighted avg 0.961 0.962 0.961
🔹 Single model — individual plots
ev.plot_confusion_matrix(model, X_test, y_test) # confusion matrix only
ev.plot_roc_curve(model, X_test, y_test) # ROC curve only
ev.plot_metrics_bar(model, X_test, y_test) # metrics bar chart only
🔹 Single model — all-in-one
ev.model_summary(model, X_test, y_test)
Produces a 2×2 dashboard in one figure:
┌─────────────────────┬─────────────────────┐
│ A · Confusion │ B · ROC Curve │
│ Matrix │ │
├─────────────────────┼─────────────────────┤
│ C · Metrics │ D · Classification │
│ Bar Chart │ Report │
└─────────────────────┴─────────────────────┘
Also prints metrics + interpretation to the terminal.
🔸 Bias–Variance — single model
ev.bv_stats(model, X_train, y_train) # stats + diagnosis only, no plot
ev.plot_learning_curve(model, X_train, y_train) # learning curve plot only
ev.bias_variance(model, X_train, y_train) # stats + plot together
Example output — ev.bv_stats()
✅ Random Forest
Train Acc : 1.0000
Val Acc : 0.9359
Gap : 0.0641 (threshold: 0.10)
Val Std : 0.0167 (variance proxy)
Diagnosis : Good Fit
Train–val gap is 0.064 and val error is 0.064.
The model generalises well — no strong signs of overfit or underfit.
→ Next step: evaluate on the held-out test set.
Diagnosis logic:
| Condition | Diagnosis | What it means |
|---|---|---|
train − val gap > 0.10 |
🔴 Overfit | Model memorises training data, fails to generalise |
1 − val_accuracy > 0.15 |
🟡 Underfit | Model too simple to capture the pattern |
| otherwise | ✅ Good Fit | Model generalises well |
Thresholds are configurable:
ev.bias_variance(
model, X_train, y_train,
overfit_threshold=0.05, # stricter
high_bias_threshold=0.10,
)
🔸 Multiple models — text output
ev.compare_metrics(models, X_test, y_test) # metrics table + winner per metric, no plot
ev.compare_interpret(models, X_test, y_test) # interpretation per model, no plot
Example output — ev.compare_metrics()
=================================================================
Model Comparison
=================================================================
Model Accuracy F1 Precision Recall ROC-AUC
──────────────────────────────────────────────────────────────
Random Forest 0.9625 0.9634 0.9518 0.9753 0.9930
Logistic Regression 0.9375 0.9412 0.8989 0.9877 0.9833
Winners by metric:
Accuracy → Random Forest (0.9625)
F1 → Random Forest (0.9634)
Precision → Random Forest (0.9518)
Recall → Logistic Regression (0.9877)
ROC-AUC → Random Forest (0.9930)
✅ Overall recommendation: Random Forest
(weighted score — F1 & Recall weighted highest)
🔸 Multiple models — individual plots
ev.plot_confusion_matrices(models, X_test, y_test) # one matrix per model
ev.compare_roc_curves(models, X_test, y_test) # overlaid ROC curves
ev.plot_metrics_comparison(models, X_test, y_test) # grouped bar chart
🔸 Multiple models — all-in-one shortcuts
ev.comparison_dashboard(models, X_test, y_test)
ev.compare_bias_variance(models, X_train, y_train)
comparison_dashboard — full 3-row dashboard:
Row 1 — Confusion matrix per model
Row 2 — Overlaid ROC curves + Metrics bar chart
Row 3 — Colour-coded summary table (green = top, red = bottom)
compare_bias_variance — two-row B-V dashboard:
Row 1 — Learning curve per model with diagnosis label
Row 2 — Bias proxy · Variance · Overfitting gap comparison
All parameters
Single model functions
ev.metrics(model, X_test, y_test, model_name="Model", verbose=True)
ev.interpret(model, X_test, y_test, model_name="Model", verbose=True)
ev.classification_report(model, X_test, y_test, model_name="Model", class_labels=None)
ev.plot_confusion_matrix(model, X_test, y_test,
model_name="Model",
class_labels=None, # e.g. ["Not Churned", "Churned"]
color="#2E75B6",
figsize=(5, 4.5),
save_path=None,
)
ev.plot_roc_curve(model, X_test, y_test,
model_name="Model",
color="#2E75B6",
figsize=(6, 5),
save_path=None,
)
ev.plot_metrics_bar(model, X_test, y_test,
model_name="Model",
color="#2E75B6",
figsize=(7, 4),
save_path=None,
)
ev.model_summary(model, X_test, y_test,
model_name="Model",
class_labels=None,
color="#2E75B6",
figsize=(14, 10),
save_path=None,
)
Bias–Variance functions
ev.bv_stats(model, X_train, y_train,
model_name="Model",
n_splits=5,
random_state=42,
train_sizes=None, # default: linspace(0.1, 1.0, 8)
scoring="accuracy",
overfit_threshold=0.10,
high_bias_threshold=0.15,
verbose=True,
)
ev.plot_learning_curve(model, X_train, y_train,
model_name="Model",
n_splits=5,
random_state=42,
train_sizes=None,
scoring="accuracy",
overfit_threshold=0.10,
high_bias_threshold=0.15,
color="#2E75B6",
figsize=(8, 5),
save_path=None,
)
# bias_variance() accepts all parameters above
Multi-model functions
# All multi-model functions accept:
models = {"name": fitted_estimator, ...} # required
colors = None # list of colours, one per model
figsize = None # auto-sized if not given
save_path = None # saves figure to this path
# compare_metrics / compare_interpret also accept:
verbose = True
# plot_confusion_matrices / comparison_dashboard also accept:
class_labels = None # e.g. ["No", "Yes"]
# compare_bias_variance also accepts all bv_stats parameters
Return values
| Function | Returns |
|---|---|
metrics() |
dict — accuracy, f1, precision, recall, roc_auc, y_pred, y_prob, report |
interpret() |
str — full interpretation text |
bv_stats() |
dict — lc_sizes, lc_train, lc_val, mean_train, mean_val, val_std, gap, bias_proxy, diagnosis, explanation |
compare_metrics() |
pandas.DataFrame — one row per model |
compare_interpret() |
dict — {model_name: interpretation_string} |
all plot_* functions |
None |
model_summary() |
None |
comparison_dashboard() |
None |
bias_variance() |
None |
compare_bias_variance() |
None |
Roadmap
- v1.2 — Threshold optimiser (best decision threshold for F1 / Recall)
- v1.3 — Probability calibration plot (reliability diagram)
- v1.4 — Feature importance comparison across models
- v2.0 — Auto PDF report
Contributing
Issues and pull requests are welcome.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ml_evaluator-1.0.2.tar.gz.
File metadata
- Download URL: ml_evaluator-1.0.2.tar.gz
- Upload date:
- Size: 23.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6de8d4edb4050305a61c714c3822e6abd74c861c0fb84e7ad2396929bf633d9
|
|
| MD5 |
717689cf25a0cf158f26b7f0c7bc7593
|
|
| BLAKE2b-256 |
0cf9470659c15e66e8a26948b867aeef372e64a6d30d7119f897d1f53d9662a8
|
File details
Details for the file ml_evaluator-1.0.2-py3-none-any.whl.
File metadata
- Download URL: ml_evaluator-1.0.2-py3-none-any.whl
- Upload date:
- Size: 23.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a0a22e9d2718a97c763e30b111af3fe666b59f27c85b255efb47caeb75e7ab5
|
|
| MD5 |
756b3659bbda482c87d39a74514169a9
|
|
| BLAKE2b-256 |
e9b9f0193ccf7d2e5f24ee0b147bff741ef3199fc1694a0bbccbd4f3b657bbe4
|