Coarse approximation linear function with cross validation
Project description
–
CalfCV
A binomial classifier that implements the Coarse Approximation Linear Function (CALF).
Contact
Rolf Carlson hrolfrc@gmail.com
Install
Use pip to install calfcv.
pip install calfcv
Introduction
This is a python implementation of the Coarse Approximation Linear Function (CALF). The implementation is based on the greedy forward selection algorithm described in the paper referenced below.
Currently, CalfCV provides classification and prediction for two classes, the binomial case. Multinomial classification with more than two cases is not implemented.
The feature matrix is scaled to remove the mean and have unit variance. Cross-validation is implemented to identify optimal score and coefficients. CalfCV is designed to be used with scikit-learn pipelines and composite estimators.
Example
from calfcv import CalfCV
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import numpy as np
Make a classification problem
seed = 42
X, y = make_classification(
n_samples=30,
n_features=5,
n_informative=2,
n_redundant=2,
n_classes=2,
random_state=seed
)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed)
Train the classifier
The best score is the best average auc
cls = CalfCV().fit(X_train, y_train)
cls.best_score_
0.95
The coefficients for the best score
cls.best_coef_
[-1, 1, 0, 1, 1]
The probabilities of class 1 are in the right column
We vertically stack the ground truth on the top with the probabilities of 1 on the bottom. We show the first 5 entries.
np.round(np.vstack((y_train, cls.predict_proba(X_train).T))[:, 0:5], 2)
array([[0. , 1. , 1. , 0. , 0. ], [0.71, 0.05, 0.19, 0.34, 0.54], [0.29, 0.95, 0.81, 0.66, 0.46]])
Predicting the training data should give a slightly higher score than the best_score_
That is what we see here. The reason is that best_score_ is a mean of auc over the cross validation.
roc_auc_score(y_true=y_train, y_score=cls.predict_proba(X_train)[:, 1])
0.9750000000000001
The classifier has not seen the testing data
Often we might get a lower score on the unseen data, but in this case we get a higher score.
roc_auc_score(y_true=y_test, y_score=cls.predict_proba(X_test)[:, 1])
1.0
Predicting the classes produces a lower score than using the class probabilities
The ground truth is on the top and the predicted class is on the bottom. The first column is the index. Sample 6 of y_test is predicted incorrectly but the others are correct.
y_pred = cls.predict(X_test)
np.vstack((y_test, y_pred))
array([[0, 1, 1, 0, 1, 0, 0, 0], [0, 1, 1, 0, 1, 0, 1, 0]])
The class prediction is expected to be lower than the auc prediction.
roc_auc_score(y_true=y_test, y_score=y_pred)
0.9
References
Jeffries, C.D., Ford, J.R., Tilson, J.L. et al. A greedy regression algorithm with coarse weights offers novel advantages. Sci Rep 12, 5440 (2022). https://doi.org/10.1038/s41598-022-09415-2
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.