A Python package for single-target and multi-target regression tasks.
Project description
Scikit-physlearn is a Python package for single-target and multi-target regression. It is designed to amalgamate Scikit-learn, LightGBM, XGBoost, CatBoost, and Mlxtend regressors into a unified Regressor, which:
Follows the Scikit-learn API.
Represents data in pandas.
Supports base boosting.
The repository was started by Alex Wozniakowski during his graduate studies at Nanyang Technological University.
Installation
Scikit-physlearn can be installed from PyPi:
pip install scikit-physlearn
Quick Start
A multi-target regression example:
from sklearn.datasets import load_linnerud
from sklearn.model_selection import train_test_split
from physlearn import Regressor
# Load an example dataset from Sklearn
bunch = load_linnerud(as_frame=True) # returns a Bunch instance
X, y = bunch['data'], bunch['target']
# Split the data in a supervised fashion
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=42)
# Select a regressor, e.g., LGBMRegressor from LightGBM, with a case-insensitive string.
reg = Regressor(regressor_choice='lgbmregressor', cv=5, n_jobs=-1,
scoring='neg_mean_absolute_error')
# Automatically build the pipeline with final estimator MultiOutputRegressor
# from Sklearn, then exhaustively search over the (hyper)parameters.
search_params = dict(boosting_type=['gbdt', 'goss'],
n_estimators=[6, 8, 10, 20])
reg.search(X_train, y_train, search_params=search_params,
search_method='gridsearchcv')
# Generate predictions with the refit regressors, then
# compute the average mean absolute error.
y_pred = reg.fit(X_train, y_train).predict(X_test)
score = reg.score(y_test, y_pred)
print(score['mae'].mean().round(decimals=2))
Example output:
8.04
A SHAP visualization example of a single-target regression subtask:
from physlearn.datasets import load_benchmark
from physlearn.supervised import ShapInterpret
# Load the training data from a quantum device calibration application.
X_train, _, y_train, _ = load_benchmark(return_split=True)
# Pick the single-target regression subtask: 2, using Python indexing.
index = 1
# Select a regressor, e.g., RidgeCV from Sklearn.
interpret = ShapInterpret(regressor_choice='ridgecv', target_index=index)
# Generate a SHAP force plot, and visualize the subtask predictions.
interpret.force_plot(X_train, y_train)
Example output (this plot is interactive in a notebook):
For additional examples, check out the basics directory.
Base boosting
Inspired by the process of human research, wherein scientific progress derives from prior scientific knowledge, base boosting is a modification of the standard version of gradient boosting, which is designed to emulate the paradigm of “standing on the shoulders of giants”:
To get started with base boosting, consider the following example, which compares non-nested and nested cross-validation in a quantum device calibration application with a limited supply of experimental data:
from physlearn import Regressor
from physlearn.datasets import load_benchmark, paper_params
from physlearn.supervised import plot_cv_comparison
# Number of random trials.
n_trials = 30
# Number of withheld folds in k-fold cross-validation.
n_splits = 5
# Load the training data from a quantum device calibration application, wherein
# X_train denotes the base regressor's initial predictions and y_train denotes
# the multi-target experimental observations, i.e., the eigenenergies.
X_train, _, y_train, _ = load_benchmark(return_split=True)
# Select a basis function, e.g., StackingRegressor from Sklearn with first
# layer regressors: Ridge and RandomForestRegressor from Sklearn and final
# layer regressor: KNeighborsRegressor from Sklearn.
basis_fn = 'stackingregressor'
stack = dict(regressors=['ridge', 'randomforestregressor'],
final_regressor='kneighborsregressor')
# Number of basis functions in the noise term of the additive expansion.
n_regressors = 1
# Choice of squared error loss function for the pseduo-residual computation.
boosting_loss = 'ls'
# Choice of parameters for the line search computation.
line_search_regularization = 0.1
line_search_options = dict(init_guess=1, opt_method='minimize',
alg='Nelder-Mead', tol=1e-7,
options={"maxiter": 10000},
niter=None, T=None,
loss='lad')
# (Hyper)parameters to to exhaustively search over, namely the regularization strength
# in ridge regression and the number of neighbors in k-nearest neighbors.
search_params = {'0__alpha': [0.5, 1.0, 1.5],
'final_estimator__n_neighbors': [3, 5, 10]}
# Choose the single-target regression subtask: 5, using Python indexing.
index = 4
# Make an instance of Regressor.
reg = Regressor(regressor_choice=basis_fn, stacking_layer=stack,
scoring='neg_mean_absolute_error', target_index=index,
n_regressors=n_regressors, boosting_loss=boosting_loss,
line_search_regularization=line_search_regularization,
line_search_options=line_search_options)
# Obtain the non-nested and the nested cross-validation scores.
non_nested_scores, nested_scores = reg.nested_cross_validate(X=X_train, y=y_train,
search_params=search_params,
n_splits=n_splits,
search_method='gridsearchcv',
n_trials=n_trials)
# Illustrate the difference between the scores.
plot_cv_comparison(non_nested_scores=non_nested_scores, nested_scores=nested_scores,
n_trials=n_trials)
Example output:
Average difference of -0.011309 with standard deviation of 0.013053.
For additional examples, check out the paper results directory:
Generate an augmented learning curve.
Establish a proxy of expert human-level performance on the calibration benchmark task with the base regressor.
Boost the initial predictions, generated by the base regressor, and evaulate the test error of the returned regressor.
Examine the utility of the base regressor, as a data preprocessor, with a SHAP summary plot.
Citation
If you use this package, please consider adding the corresponding citation:
@article{wozniakowski_2020_boosting,
title={Boosting on the shoulders of giants in quantum device calibration},
author={Wozniakowski, Alex and Thompson, Jayne and Gu, Mile and Binder, Felix},
journal={arXiv preprint arXiv:2005.06194},
year={2020}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for scikit_physlearn-0.1.3-py3.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6dda9b579f2bed926492f26cd3668c0aa44043034c6b11574b580c18205cefbf |
|
MD5 | 99bdda80cf3ee6749099d4eba94b09f0 |
|
BLAKE2b-256 | c02c521d5ebd9d0eea4c353284df7358f1927f7646faea0951dd2f976ba406a5 |
Hashes for scikit_physlearn-0.1.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54b5248d09ba5ac41d7de1e03088ed64a42b0039d7d1a22e25e0cd540e2d707f |
|
MD5 | e7532765e6384235671bf9eaed6910ae |
|
BLAKE2b-256 | 53c6f5ea2b6b9da49b915e92094a20e23855b94e4deda01addd9264429f252f2 |