Skip to main content

A Python package for simultaneous regression and binary classification for educational analytics.

Project description

dualPredictor

by D

dualPredictor is a Python package that can provide simultaneous regression and binary classification results for tabular datasets.

  • Simultaneous Predictions: A model that perform regression and binary classification tasks simultaneously
  • Regressor Selection (choose one): Choose from Lasso, Ridge, or LinearRegression(OLS) as the base regression model.
  • Dynamic Cutoff Tuning metrics (choose one): Automatically tunes the cutoff value to maximize the Youden index, F1, or F2 score. Users can choose a metrics type.

1. Youden Index (J)

https://miro.medium.com/v2/resize:fit:842/1*LVilqC3cy4AgyC1wD4RH-A.png

$$J= Recall + Specificity - 1$$ J is a measure of the overall performance of a binary classifier. It is calculated as the sum of the recall and specificity minus 1. A high J statistic indicates that the classifier performs well on positive and negative cases.

  • Recall measures a classifier's ability to identify positive cases correctly. A high recall means that the classifier is avoiding miss detects.
  • Specificity measures the ability of a classifier to identify negative cases correctly. A high specificity means that the classifier is avoiding false alarms.

2. F scores (Option F1, F2 in Package)

$$F1 = 2 \cdot \frac{precision \cdot recall}{precision + recall}$$

F1 score is another measure of the overall performance of a binary classifier. It is calculated as the harmonic mean of the precision and recall. A high F1 score indicates that the classifier is performing well on both positive and negative cases.

$$F_\beta = (1 + \beta^2) \cdot \frac{precision \cdot recall}{\beta^2 \cdot precision + recall}$$ F-score with factor beta is a generalization of the F1 score that allows for different weights to be given to precision and recall. A beta value less than 1 indicates that the F-score is prone to precision, while a beta value greater than 1 indicates that the F-score is prone to recall.

In educational settings

In educational settings, avoiding miss detects (i.e., failing to identify at-risk students) is important. However, it is also important to avoid false alarms (i.e., identifying students as at-risk when they are not). Therefore, using a measure prone to recall is often desirable, such as the F1 score with beta > 1. Youden's J statistic and the F1 score are both measures that balance the avoidance of miss detects and the avoidance of false alarms. However, Youden's J statistic is less sensitive to false alarms (Specificity is less sensitive to false alarms compared to Precision) than the F1 score.

Installation

Install dualPredictor directly from PyPI using pip:

pip install dualPredictor

Dependencies dualPredictor requires:

  • numpy
  • scikit-learn
  • matplotlib
  • seaborn

DualModel

The DualModel class is a custom regressor that combines a base regression model (lasso, ridge, or OLS) with a dual classification approach. It allows for tuning an optimal cut-off value to classify samples into two classes based on the predicted regression values.

Parameters

  • model_type (str, default='lasso'): The base regression model to use. Supported options are 'lasso', 'ridge', and 'ols' (Ordinary Least Squares).

  • metric (str, default='youden_index'): The metric used to tune the optimal cut-off value. Supported options are 'f1_score', 'f2_score', and 'youden_index'.

  • default_cut_off (float, default=0.5): The default cut-off value used to create binary labels. Samples with regression values below the cut-off are labeled as 0, and samples above or equal to the cut-off are labeled as 1.

Methods

  • fit(X, y): Fit the DualModel to the training data.

    • Parameters:

      • X (array-like of shape (n_samples, n_features)): The input training data.
      • y (array-like of shape (n_samples,)): The target values.
    • Returns:

      • self: Fitted DualModel instance.
  • predict(X): Predict the input data's regression values and binary classifications.

    • Parameters:

      • X (array-like of shape (n_samples, n_features)): The input data for prediction.
    • Returns:

      • grade_predictions (array-like of shape (n_samples,)): The predicted regression values.
      • class_predictions (array-like of shape (n_samples,)): The predicted binary classifications based on the optimal cut-off.

Attributes

  • alpha_: The alpha value of the model. This value is only available if the model is a Lasso or Ridge regression model. (OLS do not have alpha)
  • coef_: The coefficients of the model.
  • intercept_: The intercept of the model.
  • feature_names_in_: The names of the features used to train the model.
  • optimal_cut_off: The optimal cut-off value determined by the specified metric.
  • y_label_true_: The true binary labels are generated using the default cut-off value.

Example

# Import the DualModel class
from dual_model import DualModel

# Initializing and fitting the DualModel
# 'ols' for Ordinary Least Squares, a default cut-off value is provided
# The metric parameter specifies the method to tune the optimal cut-off
dual_clf = DualModel(model_type='ols', metric='youden_index', default_cut_off=1)
dual_clf.fit(X, y)

# Accessing the true binary labels generated based on the default cut-off
y_label_true = dual_clf.y_label_true_

# Retrieving the optimal cut-off value tuned based on the Youden Index
optimal_cut_off = dual_clf.optimal_cut_off

# Predicting grades (y_pred) and binary classification (at-risk or not) based on the optimal cut-off (y_label_pred)
y_pred, y_label_pred = dual_clf.predict(X)

Exmaples of Model Performances Plot

# Visualizations
# Plotting the actual vs. predicted values to assess regression performance
scatter_plot_fig = plot_scatter(y_pred, y)

# Plotting the confusion matrix to evaluate binary classification performance
cm_plot = plot_cm(y_label_true, y_label_pred)

# Plotting the non-zero coefficients of the regression model to interpret feature importance
feature_plot = plot_feature_coefficients(coef=dual_clf.coef_, feature_names=dual_clf.feature_names_in_)

References:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dualPredictor-0.0.5.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

dualPredictor-0.0.5-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file dualPredictor-0.0.5.tar.gz.

File metadata

  • Download URL: dualPredictor-0.0.5.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for dualPredictor-0.0.5.tar.gz
Algorithm Hash digest
SHA256 fc6273e0f49c6d227f5aff116ee1408ba31e68766e4938b8abbf60fee0060f93
MD5 ad8dd5703b420c3b394c7af22781bc17
BLAKE2b-256 b598918430e810f97301522f28ddc0c3fe46389afcdb4788a5b42c4d532b0e05

See more details on using hashes here.

File details

Details for the file dualPredictor-0.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for dualPredictor-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 9f2db04c5d71003fd999c178f904920250c63300fd7aa37e5b99e6e6554148a2
MD5 2fd987461179a406611dab5b5609a36d
BLAKE2b-256 5f85d62da479406d7fe9532c0d88c5be132f4dff9575950b42a43755a0c0002c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page