Skip to main content

A Python package for simultaneous regression and binary classification for educational analytics.

Project description

dualPredictor

by D

dualPredictor is a Python package that can provide simultaneous regression and binary classification results for tabular datasets.

  • Simultaneous Predictions: A model that perform regression and binary classification tasks simultaneously
  • Regressor Selection (choose one): Choose from Lasso, Ridge, or LinearRegression(OLS) as the base regression model.
  • Dynamic Cutoff Tuning metrics (choose one): Automatically tunes the cutoff value to maximize the Youden index, F1, or F2 score. Users can choose a metrics type.

1. Youden Index (J)

https://miro.medium.com/v2/resize:fit:842/1*LVilqC3cy4AgyC1wD4RH-A.png

$$J= Recall + Specificity - 1$$ J is a measure of the overall performance of a binary classifier. It is calculated as the sum of the recall and specificity minus 1. A high J statistic indicates that the classifier performs well on positive and negative cases.

  • Recall measures a classifier's ability to identify positive cases correctly. A high recall means that the classifier is avoiding miss detects.
  • Specificity measures the ability of a classifier to identify negative cases correctly. A high specificity means that the classifier is avoiding false alarms.

2. F scores (Option F1, F2 in Package)

$$F1 = 2 \cdot \frac{precision \cdot recall}{precision + recall}$$

F1 score is another measure of the overall performance of a binary classifier. It is calculated as the harmonic mean of the precision and recall. A high F1 score indicates that the classifier is performing well on both positive and negative cases.

$$F_\beta = (1 + \beta^2) \cdot \frac{precision \cdot recall}{\beta^2 \cdot precision + recall}$$ F-score with factor beta is a generalization of the F1 score that allows for different weights to be given to precision and recall. A beta value less than 1 indicates that the F-score is prone to precision, while a beta value greater than 1 indicates that the F-score is prone to recall.

In educational settings

In educational settings, avoiding miss detects (i.e., failing to identify at-risk students) is important. However, it is also important to avoid false alarms (i.e., identifying students as at-risk when they are not). Therefore, using a measure prone to recall is often desirable, such as the F1 score with beta > 1. Youden's J statistic and the F1 score are both measures that balance the avoidance of miss detects and the avoidance of false alarms. However, Youden's J statistic is less sensitive to false alarms (Specificity is less sensitive to false alarms compared to Precision) than the F1 score.

Installation

Install dualPredictor directly from PyPI using pip:

pip install dualPredictor

or Directly install from the Github Repo:

pip install git+https://github.com/098765d/dualPredictor.git

Dependencies dualPredictor requires:

  • numpy
  • scikit-learn
  • matplotlib
  • seaborn

DualModel

The DualModel class is a custom regressor that combines a base regression model (lasso, ridge, or OLS) with a dual classification approach. It allows for tuning an optimal cut-off value to classify samples into two classes based on the predicted regression values.

Parameters

  • model_type (str, default='lasso'): The base regression model to use. Supported options are 'lasso', 'ridge', and 'ols' (Ordinary Least Squares).

  • metric (str, default='youden_index'): The metric used to tune the optimal cut-off value. Supported options are 'f1_score', 'f2_score', and 'youden_index'.

  • default_cut_off (float, default=0.5): The default cut-off value used to create binary labels. Samples with regression values below the cut-off are labeled as 0, and samples above or equal to the cut-off are labeled as 1.

Methods

  • fit(X, y): Fit the DualModel to the training data.

    • Parameters:

      • X (array-like of shape (n_samples, n_features)): The input training data.
      • y (array-like of shape (n_samples,)): The target values.
    • Returns:

      • self: Fitted DualModel instance.
  • predict(X): Predict the input data's regression values and binary classifications.

    • Parameters:

      • X (array-like of shape (n_samples, n_features)): The input data for prediction.
    • Returns:

      • grade_predictions (array-like of shape (n_samples,)): The predicted regression values.
      • class_predictions (array-like of shape (n_samples,)): The predicted binary classifications based on the optimal cut-off.

Attributes

  • alpha_: The alpha value of the model. This value is only available if the model is a Lasso or Ridge regression model. (OLS do not have alpha)
  • coef_: The coefficients of the model.
  • intercept_: The intercept of the model.
  • feature_names_in_**: The names of the features used to train the model.
  • optimal_cut_off: The optimal cut-off value determined by the specified metric.
  • y_label_true_: The true binary labels are generated using the default cut-off value.

Example

# Import the DualModel class
from dual_model import DualModel

# Initializing and fitting the DualModel
# 'ols' for Ordinary Least Squares, a default cut-off value is provided
# The metric parameter specifies the method to tune the optimal cut-off
dual_clf = DualModel(model_type='ols', metric='youden_index', default_cut_off=1)
dual_clf.fit(X, y)

# Accessing the true binary labels generated based on the default cut-off
y_label_true = dual_clf.y_label_true_

# Retrieving the optimal cut-off value tuned based on the Youden Index
optimal_cut_off = dual_clf.optimal_cut_off

# Predicting grades (y_pred) and binary classification (at-risk or not) based on the optimal cut-off (y_label_pred)
y_pred, y_label_pred = dual_clf.predict(X)

Examples of Model Performances Plot

# Visualizations
# Plotting the actual vs. predicted values to assess regression performance
scatter_plot_fig = plot_scatter(y_pred, y)

# Plotting the confusion matrix to evaluate binary classification performance
cm_plot = plot_cm(y_label_true, y_label_pred)

# Plotting the non-zero coefficients of the regression model to interpret feature importance
feature_plot = plot_feature_coefficients(coef=dual_clf.coef_, feature_names=dual_clf.feature_names_in_)

Example 1: UCI student Performance Dataset

Link to UCI student Performance Dataset

https://www.kaggle.com/code/ddatad/dual-predictor-demo?scriptVersionId=167940301

Train/Test Data Information:

  • Number of data points in training set: 454 (70.0%)
  • Number of data points in test set: 195 (30.0%)

If default cut_off = 10 (label = 1 will be fail students), select lasso + youden

Train set performance

  • Number of data points: 454
  • Number of total positive (label=1): 74
  • Number of miss detects: 2
  • Number of false alarms: 61
  • Classification rate: 0.861
  • R2 = 0.83, MSE = 1.68

Test set performance

  • Number of data points: 195
  • Number of total positive (label=1): 26
  • Number of miss detects: 1
  • Number of false alarms: 22
  • Classification rate: 0.882
  • R2 = 0.88, MSE = 1.3

Example 2: a Local University Students Program GPA Prediction

Since Test Set Students does not have y-label, therefore only able to show the train set performance. default cut-off = 2.5 , lasso + youden_index

Train set performance

  • Number of data points: 154
  • Number of true positive (label=1): 5
  • Number of miss detects: 0
  • Number of false alarms: 6
  • Classification rate: 0.961
  • R2 = 0.96
  • Optimal_cut_off=2.70

Test set performance

  • Number of data points: 71
  • Number of label = 1 prediction: 3

Example 3: Object Oriented Programming Class Student Grades from Mugla Sitki Kocman University | '19 OOP Class Student Grades

https://www.kaggle.com/datasets/onurduman/grades/data

Train/Test Data Information:

  • Number of data points in training set: 33 (60.0%)
  • Number of data points in test set: 22 (40.0%)

If default cut_off = 50 (label = 1 will be fail students), select ols + youden

Train set performance

  • Number of data points: 33
  • Number of true positive (label=1): 21
  • Number of miss detects: 1
  • Number of false alarms: 1
  • Classification rate: 0.939
  • R2 = 0.94
  • Optimal_cut_off= 50

Test set performance

  • Number of data points: 22
  • Number of true positive (label=1): 13
  • Number of miss detects: 2
  • Number of false alarms: 0
  • Classification rate: 0.909
  • R2 = 0.65

References:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dualPredictor-0.0.6.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

dualPredictor-0.0.6-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file dualPredictor-0.0.6.tar.gz.

File metadata

  • Download URL: dualPredictor-0.0.6.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for dualPredictor-0.0.6.tar.gz
Algorithm Hash digest
SHA256 94b3365b0301019d23c09d8bc30afea9bdab168721019febf5a6d1f17b69fa9b
MD5 e00c173da3bb41e06687167ef45e172b
BLAKE2b-256 9b0c3a6d8dfc0c6dd06a170f4f6d1306f6d917313cd58b9d883d0d194858280e

See more details on using hashes here.

File details

Details for the file dualPredictor-0.0.6-py3-none-any.whl.

File metadata

File hashes

Hashes for dualPredictor-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9ec10d3ecdb887e44947538ff57fa84e06e21ecbe49edb5fbbb466eb53b27957
MD5 989e004968aebae4a4d88a8c6556b072
BLAKE2b-256 5edb9d005e6105d4e2fdac7b86137af089cb12e4167686c111211411f522c87d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page