Skip to main content

A Python package for simultaneous regression and binary classification for educational analytics.

Project description

Empowering Educators with An Open-Source Tool for Simultaneous Grade Prediction and At-risk Student Identification

by D

PyPI Link: https://pypi.org/project/dualPredictor/

Github Repo: https://github.com/098765d/dualPredictor/

1. Introduction

The dualPredictor package combines regression analysis with binary classification to forecast student academic outcomes. Designed to simplify the implementation of advanced algorithms, dualPredictor allows users to train models, make predictions, and visualize results with just 1 line of code. This accessibility benefits educators with varying levels of IT expertise, making sophisticated algorithms readily available. The package is easy to install via GitHub and PyPI, ensuring that educators can integrate advanced analytics into their workflows seamlessly.

The accompanying figure (Fig 1) illustrates how dualPredictor generates dual output—regression and classification—by combining a regressor and a metric.

Fig 1: How does dualPredictor provide dual prediction output?

How does the model generate both regression and binary classification results simultaneously?

  • Step 1: Grade Prediction Using the Trained Regressor (Fig 1, Step 1) fit the linear model f(x) using the training data, and grade prediction can be generated from the fitted model

        y\_pred = f(x) = \sum_{j=1}^{M} w_j x_j + b 
    
  • Step 2: Determining the Optimal Cut-off (Fig 1, Step 2)

    The goal is to find the cut-off (c) that maximizes the binary classification accuracy. here we offer 3 options of metrics that measure the classification accuracy: Youden index, f1_score, and f2_score. Firstly, the user specifies the metric type used for the model (e.g., Youden index) and denotes the metric function as g(y_true_label, y_pred_label), where:

    \text{optimal\_cut\_off} = \arg\max_c g(y_{\text{true\_label}}, y_{\text{pred\_label}}(c))
    

    This formula searches for the cut-off value that produces the highest value of the metric function g, where:

    • c: The tunned cut-off that determines the y_pred_label
    • y_true_label: True label of the data point based on the default cut-off (e.g., 1 for at-risk, 0 for normal)
    • y_pred_label: Predicted label of the data point based on the tunned cut-off value
  • Step 3: Binary Label Prediction: (Fig 1, Step 3)

    • y_pred_label = 1 (at-risk): if y_pred < optimal_cut_off
    • y_pred_label = 0 (normal): if y_pred >= optimal_cut_off

2. The Model Object (Parameters & Methods)

The dualPredictor package aims to simplify complex models for users of all coding levels. It adheres to the syntax of the scikit-learn library. The core part of the package is the model object called DualModel, which can be imported from the dualPredictor library.

Table 0: Model Parameters

Parameter Description Default Value
model_type Type of regression model to use. Options include: - 'lasso' (Lasso regression) - 'ridge' (Ridge regression) - 'ols' (Ordinary Least Squares regression) None
metric Metric used for optimizing the cut-off value. Options include: - 'f1_score' (F1 score) - 'f2_score' (F2 score) - 'youden_index' (Youden's Index) None
default_cut_off Initial cut-off value used for binary classification. None

Table 1: Model methods (scikit-learn style)

Model Methods Description
fit(X, y) - X: The input training data, pandas data frame.
- y: The target values (predicted grade).
- Returns: Fitted DualModel instance
predict(X) - X: The input training data, pandas' data frame.

Table 2: Model attributes (scikit-learn style)

Model Attributes Description
alpha_ The value of penalization in Lasso and ridge (for OLS, alpha = 0)
coef_ The coefficients of the model
Intercept_ The intercept value of the model
feature_names_in_ Names of features during model training
optimal_cut_off The optimal cut-off value that maximizes the metric

Example Usage - fit the model with just one line of code

from dualPredictor import DualModel

# Initialize the model and specify the parameters
model = DualModel(model_type='lasso', metric='f1_score', default_cut_off=2.5)

# Using model methods for training and predicting
model.fit(X_train, y_train)
grade_predictions, class_predictions = model.predict(X_train)

# Accessing model attributes
print("Alpha (regularization strength):", model.alpha_)
print("Model coefficients:", model.coef_)
print("Model intercept:", model.intercept_)
print("Feature names:", model.feature_names_in_)
print("Optimal cut-off value:", model.optimal_cut_off)

3. User Guide

3.1 Dependencies Installation

dualPredictor requires the following libraries to be installed:

  • NumPy: A fundamental package for scientific computing with Python.
  • scikit-learn: A simple and efficient tools for predictive data analysis.
  • Matplotlib: A comprehensive library for creating static, animated, and interactive visualizations in Python.
  • Seaborn: A Python data visualization library based on matplotlib that provides a high-level interface for drawing attractive statistical graphics. You can install all the dependencies at once using the following command:
pip install numpy scikit-learn matplotlib seaborn

3.2 Package Installation

You can install the dualPredictor package via PyPI or GitHub (Recommended). Choose one of the following methods:

pip install dualPredictor
pip install git+https://github.com/098765d/dualPredictor.git

3.3 Example Code

After installation, start with:

Step 1. Import the Package: Import the dualPredictor package into your Python environment.

from dualPredictor import DualModel, model_plot

Step 2. Model Initialization: Create a DualModel instance by specifying the regressor type ('lasso', 'ridge', or 'ols'), the metric for cutoff tuning ('f1_score', 'f2_score', or 'youden_index'), and a default cutoff value.

# model_type options: 'lasso', 'ridge', or 'ols'
# metric options: 'f1_score', 'f2_score', or 'youden_index'
model = DualModel(model_type='lasso', metric='youden_index', default_cut_off=2.5)

Step 3. Model Fitting: Fit the model to your dataset using the fit method.

model.fit(X_train, y_train)
  • X: The input training data (type: pandas DataFrame).
  • y: The target values (type: pandas data series).

Step 4. Predictions: Use the model's predict method to generate grade predictions and at-risk classifications.

# example for demo only, model prediction dual output
y_train_pred,y_train_label_pred=model.predict(X_train)

# example of 1st model output = predicted scores (regression result)
y_train_pred
array([3.11893389, 3.06013236, 3.05418893, 3.09776197, 3.14898782,
     2.37679417, 2.99367804, 2.77202421, 2.9603209 , 3.01052573,
     2.99974477, 3.11286716, 3.14708887, 2.78737598, 2.88134869,
     3.07517748, 3.17370297, 3.26615469, 3.2328493 , 2.98423656,
     3.02108518, 2.87746064, 3.03491596, 2.89875586, 3.11079315,
     3.23177653, 3.34291929, 2.57402463, 3.27019917, 3.20073168,
     2.94514418, 3.25307175, 3.19145494, 3.15909904, 3.01481681,
     3.07551728, 2.70973767, 3.07226583, 3.04692613, 2.8883649 ,
     2.63833457, 3.03978663, 3.20974038, 3.13091091, 3.42223703,
     3.07012029, 3.01981077, 3.22368756, 2.69376153, 2.93594929,
     2.91493381, 3.22273808, 2.59310411, 3.00767959, 3.21869359,
     2.86065334, 3.16865551, 3.11258742, 2.87948289, 2.64564212,
     2.88646595, 3.48716006, 3.14482003, 3.15513751, 3.05299286,
     3.20858237, 2.63172024, 2.42824269, 2.88352738, 3.0479989 ,
     2.82405611, 3.16516577, 2.94324523, 3.4453079 , 2.48497569,
     3.00081754, 3.04180887, 3.32979373, 3.12686642, 2.90359338,
     2.95509896, 2.96429385, 3.44471154, 3.20251564, 3.08765075,
     2.5607482 , 3.23986551, 3.19644891, 3.16032825, 2.68092384,
     3.04907167, 2.8159268 , 3.05030088, 3.178372])

# example of 2nd model output = predicted at-risk status (binary label)
y_train_label_pred
array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
     0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
     1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
     0, 1, 0, 0, 0, 0])
  • y_train_pred: Predicted grades (regression result).
  • y_train_label_pred: Predicted at-risk status (binary label).

Step 5.Visualization: Visualize the model's performance with just one line of code

# Scatter plot for regression analysis 
model_plot.plot_scatter(y_pred, y_true)

# Confusion matrix for binary classification 
model_plot.plot_cm(y_label_true, y_label_pred)

# Model's global explanation: Feature importance plot
model_plot.plot_feature_coefficients(coef=model.coef_, feature_names=model.feature_names_in_)

Fig 2: Visualization Module Sample Outputs

References

[1] Fluss, R., Faraggi, D., & Reiser, B. (2005). Estimation of the Youden Index and its associated cutoff point. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 47(4), 458-472.

[2] Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.

[3] Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.

[4] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825-2830.

[5] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267-288.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dualpredictor-0.0.18.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

dualPredictor-0.0.18-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file dualpredictor-0.0.18.tar.gz.

File metadata

  • Download URL: dualpredictor-0.0.18.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for dualpredictor-0.0.18.tar.gz
Algorithm Hash digest
SHA256 7a5921701d64b014ad90a60af1df7524791d772689fbb1e01f958a164ff3fb6d
MD5 6c10fd80816b70f1eb6a1419a3af787a
BLAKE2b-256 54cc16c7b69e3721961af7058ee9ba9959ec948c765993c0114bd825a79a33a8

See more details on using hashes here.

File details

Details for the file dualPredictor-0.0.18-py3-none-any.whl.

File metadata

File hashes

Hashes for dualPredictor-0.0.18-py3-none-any.whl
Algorithm Hash digest
SHA256 f60ef510fc904fb1e2df4a2f4cb052650a352185c888ae204b5f338d603710c1
MD5 2a8495a2b84afade26a48880867be7f5
BLAKE2b-256 13fc87c7c475eafa93c4d2a7cca38ecc9a105f0438f88d097a275cedacf86030

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page