A Python package for simultaneous regression and binary classification for educational analytics.
Project description
dualPredictor: An Open-Source Tool for Simultaneously Grade Prediction and At-Risk Student Classification
by Dong, Cheng, and Kan
1. Introduction
The dualPredictor tool combines regression analysis with binary classification to forecast student academic outcomes and identify at-risk students. This user guide provides a step-by-step walkthrough on how to install and use the dualPredictor package. The figure below illustrates the mechanism of how dualPredictor generates dual output (regression and classification) by combining a regressor and a metric.
1.1 How does dualPredictor provide dual prediction output?
- Output 1 = Grade prediction: from the trained regressor (e.g., Lasso)
- Optimal cut-off: The default cut-off is the ground truth criteria to distinguish at-risk students, and the optimal cut-off is a tunned value that maximizes the metric (e.g., Youden Index) for a given regressor with the corresponding default cut-off value.
- Output 2 = Binary label prediction:
- if predicted grade < optimal cut-off: label = 1
- if predicted grade >= optimal cut-off: label = 0
Fig 1: How does dualPredictor provide dual prediction output?
1.2 How does dualPredictor provide model explanations?
- Global level model explanations: Model's feature coefficients plot
- Local level model explanations: Model's feature contribution for a specific data point
-
How to get the feature contribution?
Given a linear model with a total number of M features, the model can be represented as:
f(x) = \sum_{j=1}^{M} w_j x_j + b
The j-th feature contribution for the i-th data point can be approximated from the formula:
\phi_i(f, x) = w_j (x_j - E[x_j])
The formula can be seen as a simple approximation of the Shapley value from page 6 of the papaer Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.
-
2. Motivation
The dualPredictor package's motivation is to make complex models as simple as possible for all users, regardless of their coding experience. The model package is designed using the same syntax as the popular scikit-learn models, making it easy for users with experience in scikit-learn to start using the dualPredictor quickly. The model attributes and model methods(model.fit(X, y); model.predict(X)) is intentionally designed to mimic the scikit-learn model object, providing a familiar and user-friendly experience for the user.
# intialize the model, specify the parameters
model = DualModel(model_type='lasso', metric='f1_score', default_cut_off=2.5)
Table 1: Model methods and attributes (same style as sklearn model object)
Model Methods | Description |
---|---|
fit(X, y) |
- X: The input training data, pandas data frame. - y: The target values (predicted grade). - Returns: Fitted DualModel instance |
predict(X) |
- X: The input training data, pandas' data frame. |
Model Attributes | Description |
---|---|
alpha_ |
The value of penalization in Lasso and ridge, for OLS alpha = 0 |
coef_ |
The coefficients of the model |
Intercept_ |
The intercept value of the model |
feature_names_in_ |
Names of features during model training |
optimal_cut_off |
The optimal cut-off value that maximizes the metric |
3. Installation
You can install the dualPredictor package via PyPI or GitHub. Choose one of the following methods:
PyPI Installation
pip install dualPredictor
GitHub Installation (Recommended; Latest Version)
pip install git+https://github.com/098765d/dualPredictor.git
4. User Guide with Examples of Code
Step 1. Import the Package: Import the dualPredictor package in your Python environment.
from dualPredictor import DualModel, model_plot
Step 2. Model Initialization: Create a DualModel instance by specifying the regression model type ('lasso', 'ridge', or 'ols'), the metric for cutoff tuning ('f1_score', 'f2_score', or 'youden_index'), and a default cutoff value.
model = DualModel(model_type='lasso', metric='youden_index', default_cut_off=2.5)
Step 3. Model Fitting: Fit the model to your dataset using the fit method.
model.fit(X_train, y_train)
- X: The input training data (pandas DataFrame).
- y: The target values (predicted grades).
Step 4. Predictions: Use the prediction method to generate grade predictions and at-risk classifications.
# example for demo only, model prediction dual output
y_train_pred,y_train_label_pred=model.predict(X_train)
# example of 1st model output = predicted scores (regression result)
y_train_pred
array([3.11893389, 3.06013236, 3.05418893, 3.09776197, 3.14898782,
2.37679417, 2.99367804, 2.77202421, 2.9603209 , 3.01052573,
2.99974477, 3.11286716, 3.14708887, 2.78737598, 2.88134869,
3.07517748, 3.17370297, 3.26615469, 3.2328493 , 2.98423656,
3.02108518, 2.87746064, 3.03491596, 2.89875586, 3.11079315,
3.23177653, 3.34291929, 2.57402463, 3.27019917, 3.20073168,
2.94514418, 3.25307175, 3.19145494, 3.15909904, 3.01481681,
3.07551728, 2.70973767, 3.07226583, 3.04692613, 2.8883649 ,
2.63833457, 3.03978663, 3.20974038, 3.13091091, 3.42223703,
3.07012029, 3.01981077, 3.22368756, 2.69376153, 2.93594929,
2.91493381, 3.22273808, 2.59310411, 3.00767959, 3.21869359,
2.86065334, 3.16865551, 3.11258742, 2.87948289, 2.64564212,
2.88646595, 3.48716006, 3.14482003, 3.15513751, 3.05299286,
3.20858237, 2.63172024, 2.42824269, 2.88352738, 3.0479989 ,
2.82405611, 3.16516577, 2.94324523, 3.4453079 , 2.48497569,
3.00081754, 3.04180887, 3.32979373, 3.12686642, 2.90359338,
2.95509896, 2.96429385, 3.44471154, 3.20251564, 3.08765075,
2.5607482 , 3.23986551, 3.19644891, 3.16032825, 2.68092384,
3.04907167, 2.8159268 , 3.05030088, 3.178372 ])
# example of 2nd model output = predicted at-risk status (binary label)
y_train_label_pred
array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 1, 0, 0, 0, 0])
- y_train_pred: Predicted grades (regression result).
- y_train_label_pred: Predicted at-risk status (binary label).
Step 5.Visualization: Visualize the model's performance using the model_plot module (Optional)
# Scatter plot for regression analysis - a
model_plot.plot_scatter(y_pred, y_true)
# Confusion matrix for binary classification - b
model_plot.plot_cm(y_label_true, y_label_pred)
# Model's global explanation: Feature importance plot - c
model_plot.plot_feature_coefficients(coef=model.coef_, feature_names=model.feature_names_in_)
# Model's local explanation: Feature contributions for each data point - d
# 'idx' is the index value used to locate a specific row in the dataframe
plot_local_shap(X=X_test, model=model, idx='E115CCCD')
Fig 2: Sample plots by the model_plot modules
References
[1] Fluss, R., Faraggi, D., & Reiser, B. (2005). Estimation of the Youden Index and its associated cutoff point. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 47(4), 458-472.
[2] Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.
[3] Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.
[4] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825-2830.
[5] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267-288.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dualpredictor-0.0.14.tar.gz
.
File metadata
- Download URL: dualpredictor-0.0.14.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | de40a3df773575f6b6c7b0364d26918e6105f266d11e39efbb5aa4cb3f1592fe |
|
MD5 | a928f42d0dc0e27930d8301cf89fff3d |
|
BLAKE2b-256 | 09c546717a13030dfce54a1689f185010f55ba152c597fd05b76315f1c81af00 |
File details
Details for the file dualPredictor-0.0.14-py3-none-any.whl
.
File metadata
- Download URL: dualPredictor-0.0.14-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3a3c0dce0442fece8765ca1d6e850ff38e5e8632d43e8d0b5358a0fa06805d99 |
|
MD5 | 561a46a1a1f256ba7a4b5fead5357146 |
|
BLAKE2b-256 | 590662621ee3b0a074a8eb440c528cfe1643c2c0ceff44324b20d2f21bd96e8d |