Skip to main content

Optimize decision boundary/threshold for predicted probabilities from binary classification

Project description

threshold_optimizer

This python library allows you to conveniently evaluate predicted probablilities during a binary classification task by presenting you with the optimum probability thresholds.

Introduction

Classification tasks in machine learning involves models or algorithms learning to assign class lables to elements of a set. Binary Classification is the process of assigning elements to two class labels on the basis of a classification rule. Some of the examples of binary classification includes classifying mails under 'spam' or 'not a spam', medical tests ('cancer detected' or 'cancer not detected') and churn prediction ('churn' or 'not').

Evaluating machine learning models is an important aspect of building models. These evaluations are done using classification metrics, the metrics used depends on the nature of the problem you're solving and the cost of falsely predicted values. Some of these metrics include: confusion matrix, accuracy, precision, recall, F1 score and ROC curve. However these decisions by the metrics are based on a set threshold.

For instance, in order to map a probability representation from logistic regression to a binary category, you must define a classification threshold (also called the decision threshold). In say a cancer patient classification, a value above that threshold indicates "Patient has cancer"; a value below indicates "Patient does not have cancer." It is tempting to assume that the classification threshold should always be 0.5, but thresholds are problem-dependent, and are therefore values that you must tune.

This library allows you to output the optimum threshold value for the metric you're using to evaluate your classification model. The metrics for which you can get the optimum threshold outputs are:

Accuracy

F1 Score

Recall

Specificity

Precision

Requirements

scikit-learn == 0.24.0

pandas == 0.25.1

numpy == 1.17.1

Installation

Usage

Code To Follow

from threshold_optimizer import ThresholdOptimizer
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# load data sets
X, y = datasets.load_breast_cancer(return_X_y=True)

# train, val, test splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

# fit estimator
clf = LogisticRegression(random_state=0).fit(X_train, y_train)
# predict probabilities
predicted_probabilities = clf.predict_proba(X_val)

# apply optimization
thresh_opt = ThresholdOptimizer(
        y_score = predicted_probabilities,
        y_true = y_val
    )

# optimize for accuracy and f1 score
thresh_opt.optimize_metrics(
        metrics=['accuracy', 'f1'],
        verbose=True
    )

# display results
print(thresh_opt.optimized_metrics)

# access threshold per metric
accuracy_threshold = thresh_opt.optimized_metrics.accuracy.best_threshold
f1_threshold = thresh_opt.optimized_metrics.f1.best_threshold

# use best accuracy threshold for test set to convert probabilities to classes
predicted_probabilities = clf.predict_proba(X_test)
classes = np.where(predicted_probabilities[:,1], > accuracy_threshold, 1, 0)
print(classes)

Key Terminologies

:TODO

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

threshold_optimizer-0.0.1a2.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

threshold_optimizer-0.0.1a2-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file threshold_optimizer-0.0.1a2.tar.gz.

File metadata

  • Download URL: threshold_optimizer-0.0.1a2.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for threshold_optimizer-0.0.1a2.tar.gz
Algorithm Hash digest
SHA256 468ece1e1f944d97470da0bc7af2fadd2ccf172cf0c3b6dc0dadc95d5eb89252
MD5 ffb94f12a7db3914a7ad252b6ab4ebfe
BLAKE2b-256 de1a5b11c75290779bc0efaedd7a58fb653e4c54b9befa80f86ab8f3716aea33

See more details on using hashes here.

File details

Details for the file threshold_optimizer-0.0.1a2-py3-none-any.whl.

File metadata

  • Download URL: threshold_optimizer-0.0.1a2-py3-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for threshold_optimizer-0.0.1a2-py3-none-any.whl
Algorithm Hash digest
SHA256 d26eeb3fdd5cfdaea841c2eaff10890d0514a904c954474b8024532c0e47d1cc
MD5 aa5a0f8e37ab2f327102fdbb8bf56f11
BLAKE2b-256 536fb597cedad110ee871668b1a422f5c7b225e27a732a46c7a07edc1643ed13

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page