Skip to main content

Neyman-Pearson (NP) Classification Algorithms and NP Receiver Operating Characteristic (NP-ROC) Curves

Project description

nproc: Neyman-Pearson (NP) Classification Algorithms and NP Receiver Operating Characteristic (NP-ROC) Curves

Python Implementation Authors

Richard Zhao, Yang Feng, Jingyi Jessica Li and Xin Tong

In many binary classification applications, such as disease diagnosis and spam detection, practitioners commonly face the need to limit type I error rate (i.e., the conditional probability of misclassifying a class 0 observation as class 1) so that it remains below a desired threshold. To address this need, the Neyman-Pearson (NP) classification paradigm is a natural choice; it minimizes type II error rate (i.e., the conditional probability of misclassifying a class 1 observation as class 0) while enforcing an upper bound, alpha, on the type I error rate. Although the NP paradigm has a century-long history in hypothesis testing, it has not been well recognized and implemented in classification schemes. Common practices that directly limit the empirical type I error rate to no more than alpha do not satisfy the type I error rate control objective because the resulting classifiers are still likely to have type I error rates much larger than alpha. As a result, the NP paradigm has not been properly implemented for many classification scenarios in practice. In this work, we develop the first umbrella algorithm that implements the NP paradigm for all scoring-type classification methods, including popular methods such as logistic regression, support vector machines and random forests. Powered by this umbrella algorithm, we propose a novel graphical tool for NP classification methods: NP receiver operating characteristic (NP-ROC) bands, motivated by the popular receiver operating characteristic (ROC) curves. NP-ROC bands will help choose in a data adaptive way and compare different NP classifiers.

Details

See details in: https://doi.org/10.1126/sciadv.aao1659

Social media application: https://doi.org/10.1080/01621459.2020.1740711

Usage

npc(x, y, method = ("logistic", "svm", "nb", "rf", "",...), model = None, alpha = 0.05, delta = 0.05, split = 1, split_ratio = 0.5, n_cores = 1, band = False, randSeed = 0)

Arguments

x   		n * p observation matrix. n observations, p covariates.
y   		n 0/1 observatons.
method  	logistic: Logistic Regression.
    		svm: Support Vector Machine.
    		nb: Gaussian Naive Bayes.
    		nb_m: Multinomial Naive Bayes.
    		rf: Random Forest.
    		dt: Decision Tree.
    		keras: Keras Deep Learning. Model must be provided.
model		use the specified model instead of method. Default is None and model is created from method.
alpha		the desirable upper bound on type I error rate. Default = 0.05.
delta		the violation rate of the type I error rate. Default = 0.05.
split		the number of splits for the class 0 sample. Default = 1. For ensemble version, choose split > 1.
split_ratio	the ratio of splits used for the class 0 sample to train the classifier. Default = 0.5.
n_cores		number of cores used for parallel computing. Default = 1.
band		whether to generate both lower and upper bounds of type II error rate. Default = False.
randSeed	the random seed used in the algorithm.

Example

import numpy as np
import os
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from nproc import npc


test = npc()

np.random.seed()

# Create a dataset (x,y) with 2 features, binary label and sample size 10000.
n = 10000
x = np.random.normal(0, 1, (n,2))
c = 1+3*x[:,0]
y = np.random.binomial(1, 1/(1+np.exp(-c)), n)

# Call the npc function to construct Neyman-Pearson classifiers.
# The default type I error rate upper bound is alpha=0.05.
fit = test.npc(x, y, 'logistic', n_cores=os.cpu_count())

# Evaluate the prediction of the NP classifier fit on a test set (xtest, ytest).
x_test = np.random.normal(0, 1, (n,2))
c_test = 1+3*x_test[:,0]
y_test = np.random.binomial(1, 1/(1+np.exp(-c_test)), n)

# Calculate the overall accuracy of the classifier as well as the realized 
# type I error rate on test data.
# Strictly speaking, to demonstrate the effectiveness of the fit classifier 
# under the NP paradigm, we should repeat this experiment many times, and 
# show that in 1 - delta of these repetitions, type I error rate is smaller than alpha.

fitted_score = test.predict(fit,x)
print("Accuracy on training set:", accuracy_score(fitted_score[0], y))
pred_score = test.predict(fit,x_test)
print("Accuracy on test set:", accuracy_score(pred_score[0], y_test))

cm = confusion_matrix(y_test, pred_score[0])
print("Confusion matrix:")
print(cm)
tn, fp, fn, tp = cm.ravel()
print("Type I error rate: {:.5f}".format(fp/(fp+tn)))
print("Type II error rate: {:.5f}".format(fn/(fn+tp)))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nproc-1.5.2.tar.gz (19.1 kB view details)

Uploaded Source

Built Distribution

nproc-1.5.2-py3-none-any.whl (19.3 kB view details)

Uploaded Python 3

File details

Details for the file nproc-1.5.2.tar.gz.

File metadata

  • Download URL: nproc-1.5.2.tar.gz
  • Upload date:
  • Size: 19.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.8

File hashes

Hashes for nproc-1.5.2.tar.gz
Algorithm Hash digest
SHA256 8dbc61c7ed13c86950f90f2dfb5c40296b22c97ab2ee9845a08474eff4be5365
MD5 7b3fffc6c0b169f1b2a0a05c744f8faa
BLAKE2b-256 97d978d02d08b93d7ca263cc58cfc44c683acb447d12e4ebbc095cfaea7bd0d8

See more details on using hashes here.

File details

Details for the file nproc-1.5.2-py3-none-any.whl.

File metadata

  • Download URL: nproc-1.5.2-py3-none-any.whl
  • Upload date:
  • Size: 19.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.8

File hashes

Hashes for nproc-1.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 694a53a61f761fef12be0e07c560a8833ec6d37d66c9f8ecd475a3e191546187
MD5 169ef7d6157893ed9c58fc0e7a9e4a35
BLAKE2b-256 8b1a9fceb13509040c67e4813695d19410f9f78fe9fd17da9db0fdc6d2fee358

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page