Skip to main content

Model Evaluation Toolkit — bias-variance analysis, ROC curves, model summary & multi-model comparison

Project description

ml-evaluator

Stop copy-pasting evaluation code.

PyPI version Python License: MIT

pip install ml-evaluator

What is it?

ml-evaluator turns model evaluation from a chore into a single function call.

Every function is standalone — pass your model, X_test, and y_test and you're done. No pipelines to build, no intermediate objects to store, no boilerplate to copy.

Built for people who:

  • know how to train a model but want clean, reproducible evaluation
  • are learning ML and want plots + plain-English interpretation, not just numbers
  • are tired of writing the same confusion-matrix / ROC / bias-variance code every project

Install

pip install ml-evaluator
import ml_evaluator as ev

Requirements: Python 3.8+ · numpy · pandas · matplotlib · scikit-learn


API

Every piece of evaluation is its own function — use exactly what you need.

🔹 Single model — text output

ev.metrics(model, X_test, y_test)           # numbers only, no plot
ev.interpret(model, X_test, y_test)         # plain-English interpretation, no plot
ev.classification_report(model, X_test, y_test)  # classification report table, no plot
Example output — ev.metrics()
=======================================================
  Model Summary — Random Forest
=======================================================
  Accuracy  : 0.9625
  F1        : 0.9634
  Precision : 0.9518
  Recall    : 0.9753
  ROC-AUC   : 0.9930
Example output — ev.interpret()
  Interpretation — Random Forest
  ─────────────────────────────────────────────────────
  ✅  Accuracy 0.963 — high overall correctness.
  ✅  F1 0.963 — strong balance between precision and recall.
  ✅  Precision (0.952) ≈ Recall (0.975) — well balanced.
       AUC = 0.993 — Excellent. The model clearly separates the two classes.
Example output — ev.classification_report()
  Classification Report — Random Forest
  ────────────────────────────────────────────────────────
  Class             Precision   Recall  F1-Score  Support
  ────────────────────────────────────────────────────────
  0                     0.970    0.952     0.961      100
  1                     0.953    0.971     0.962      100
  ────────────────────────────────────────────────────────
  Accuracy                                 0.963      200
  macro avg             0.961    0.962     0.961
  weighted avg          0.961    0.962     0.961

🔹 Single model — individual plots

ev.plot_confusion_matrix(model, X_test, y_test)   # confusion matrix only
ev.plot_roc_curve(model, X_test, y_test)           # ROC curve only
ev.plot_metrics_bar(model, X_test, y_test)         # metrics bar chart only

🔹 Single model — all-in-one

ev.model_summary(model, X_test, y_test)

Produces a 2×2 dashboard in one figure:

┌─────────────────────┬─────────────────────┐
│  A · Confusion      │  B · ROC Curve      │
│      Matrix         │                     │
├─────────────────────┼─────────────────────┤
│  C · Metrics        │  D · Classification │
│      Bar Chart      │      Report         │
└─────────────────────┴─────────────────────┘

Also prints metrics + interpretation to the terminal.


🔸 Bias–Variance — single model

ev.bv_stats(model, X_train, y_train)          # stats + diagnosis only, no plot
ev.plot_learning_curve(model, X_train, y_train)   # learning curve plot only
ev.bias_variance(model, X_train, y_train)     # stats + plot together
Example output — ev.bv_stats()
  ✅ Random Forest
     Train Acc : 1.0000
     Val   Acc : 0.9359
     Gap       : 0.0641   (threshold: 0.10)
     Val Std   : 0.0167   (variance proxy)
     Diagnosis : Good Fit
     Train–val gap is 0.064 and val error is 0.064.
     The model generalises well — no strong signs of overfit or underfit.
     → Next step: evaluate on the held-out test set.

Diagnosis logic:

Condition Diagnosis What it means
train − val gap > 0.10 🔴 Overfit Model memorises training data, fails to generalise
1 − val_accuracy > 0.15 🟡 Underfit Model too simple to capture the pattern
otherwise ✅ Good Fit Model generalises well

Thresholds are configurable:

ev.bias_variance(
    model, X_train, y_train,
    overfit_threshold=0.05,      # stricter
    high_bias_threshold=0.10,
)

🔸 Multiple models — text output

ev.compare_metrics(models, X_test, y_test)     # metrics table + winner per metric, no plot
ev.compare_interpret(models, X_test, y_test)   # interpretation per model, no plot
Example output — ev.compare_metrics()
=================================================================
  Model Comparison
=================================================================
  Model                   Accuracy        F1 Precision    Recall   ROC-AUC
  ──────────────────────────────────────────────────────────────
  Random Forest             0.9625    0.9634    0.9518    0.9753    0.9930
  Logistic Regression       0.9375    0.9412    0.8989    0.9877    0.9833

  Winners by metric:
    Accuracy     → Random Forest        (0.9625)
    F1           → Random Forest        (0.9634)
    Precision    → Random Forest        (0.9518)
    Recall       → Logistic Regression  (0.9877)
    ROC-AUC      → Random Forest        (0.9930)

  ✅  Overall recommendation: Random Forest
      (weighted score — F1 & Recall weighted highest)

🔸 Multiple models — individual plots

ev.plot_confusion_matrices(models, X_test, y_test)   # one matrix per model
ev.compare_roc_curves(models, X_test, y_test)         # overlaid ROC curves
ev.plot_metrics_comparison(models, X_test, y_test)    # grouped bar chart

🔸 Multiple models — all-in-one shortcuts

ev.comparison_dashboard(models, X_test, y_test)
ev.compare_bias_variance(models, X_train, y_train)

comparison_dashboard — full 3-row dashboard:

Row 1 — Confusion matrix per model
Row 2 — Overlaid ROC curves  +  Metrics bar chart
Row 3 — Colour-coded summary table (green = top, red = bottom)

compare_bias_variance — two-row B-V dashboard:

Row 1 — Learning curve per model with diagnosis label
Row 2 — Bias proxy  ·  Variance  ·  Overfitting gap comparison

All parameters

Single model functions

ev.metrics(model, X_test, y_test, model_name="Model", verbose=True)
ev.interpret(model, X_test, y_test, model_name="Model", verbose=True)
ev.classification_report(model, X_test, y_test, model_name="Model", class_labels=None)

ev.plot_confusion_matrix(model, X_test, y_test,
    model_name="Model",
    class_labels=None,       # e.g. ["Not Churned", "Churned"]
    color="#2E75B6",
    figsize=(5, 4.5),
    save_path=None,
)

ev.plot_roc_curve(model, X_test, y_test,
    model_name="Model",
    color="#2E75B6",
    figsize=(6, 5),
    save_path=None,
)

ev.plot_metrics_bar(model, X_test, y_test,
    model_name="Model",
    color="#2E75B6",
    figsize=(7, 4),
    save_path=None,
)

ev.model_summary(model, X_test, y_test,
    model_name="Model",
    class_labels=None,
    color="#2E75B6",
    figsize=(14, 10),
    save_path=None,
)

Bias–Variance functions

ev.bv_stats(model, X_train, y_train,
    model_name="Model",
    n_splits=5,
    random_state=42,
    train_sizes=None,            # default: linspace(0.1, 1.0, 8)
    scoring="accuracy",
    overfit_threshold=0.10,
    high_bias_threshold=0.15,
    verbose=True,
)

ev.plot_learning_curve(model, X_train, y_train,
    model_name="Model",
    n_splits=5,
    random_state=42,
    train_sizes=None,
    scoring="accuracy",
    overfit_threshold=0.10,
    high_bias_threshold=0.15,
    color="#2E75B6",
    figsize=(8, 5),
    save_path=None,
)

# bias_variance() accepts all parameters above

Multi-model functions

# All multi-model functions accept:
models    = {"name": fitted_estimator, ...}   # required
colors    = None     # list of colours, one per model
figsize   = None     # auto-sized if not given
save_path = None     # saves figure to this path

# compare_metrics / compare_interpret also accept:
verbose   = True

# plot_confusion_matrices / comparison_dashboard also accept:
class_labels = None  # e.g. ["No", "Yes"]

# compare_bias_variance also accepts all bv_stats parameters

Return values

Function Returns
metrics() dict — accuracy, f1, precision, recall, roc_auc, y_pred, y_prob, report
interpret() str — full interpretation text
bv_stats() dict — lc_sizes, lc_train, lc_val, mean_train, mean_val, val_std, gap, bias_proxy, diagnosis, explanation
compare_metrics() pandas.DataFrame — one row per model
compare_interpret() dict — {model_name: interpretation_string}
all plot_* functions None
model_summary() None
comparison_dashboard() None
bias_variance() None
compare_bias_variance() None

Roadmap

  • v1.2 — Threshold optimiser (best decision threshold for F1 / Recall)
  • v1.3 — Probability calibration plot (reliability diagram)
  • v1.4 — Feature importance comparison across models
  • v2.0 — Auto PDF report

Contributing

Issues and pull requests are welcome.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml_evaluator-1.0.2.tar.gz (23.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ml_evaluator-1.0.2-py3-none-any.whl (23.1 kB view details)

Uploaded Python 3

File details

Details for the file ml_evaluator-1.0.2.tar.gz.

File metadata

  • Download URL: ml_evaluator-1.0.2.tar.gz
  • Upload date:
  • Size: 23.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ml_evaluator-1.0.2.tar.gz
Algorithm Hash digest
SHA256 c6de8d4edb4050305a61c714c3822e6abd74c861c0fb84e7ad2396929bf633d9
MD5 717689cf25a0cf158f26b7f0c7bc7593
BLAKE2b-256 0cf9470659c15e66e8a26948b867aeef372e64a6d30d7119f897d1f53d9662a8

See more details on using hashes here.

File details

Details for the file ml_evaluator-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: ml_evaluator-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 23.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ml_evaluator-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6a0a22e9d2718a97c763e30b111af3fe666b59f27c85b255efb47caeb75e7ab5
MD5 756b3659bbda482c87d39a74514169a9
BLAKE2b-256 e9b9f0193ccf7d2e5f24ee0b147bff741ef3199fc1694a0bbccbd4f3b657bbe4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page