A Python package for Gaussian Process Regression with hyperparameter optimization using Hyperopt and cross-validation, focusing on optimizing cross-validated loss.
Project description
Bayesian GP CVLoss: Gaussian Process Regression with Cross-Validated Hyperparameter Optimization
bayesian_gp_cvloss is a Python package designed to simplify the process of training Gaussian Process (GP) models by finding optimal hyperparameters through Bayesian optimization (using Hyperopt) with k-fold cross-validation. The key feature of this package is its direct optimization of the cross-validated Root Mean Squared Error (RMSE), aligning the hyperparameter tuning process closely with the model's predictive performance.
This package is particularly useful for researchers and practitioners who want to apply GP models without manually tuning hyperparameters or relying solely on maximizing marginal likelihood, offering a more direct approach to achieving good generalization on unseen data.
Core Idea
The traditional approach to training GP models often involves maximizing the log marginal likelihood of the model parameters. While effective, this doesn't always directly translate to the best predictive performance on unseen data, especially when the model assumptions are not perfectly met or when working with smaller datasets.
This library implements an alternative strategy:
- Define a search space for the GP kernel parameters (e.g., length scales, kernel variance) and likelihood parameters (e.g., noise variance).
- Use Bayesian optimization (Hyperopt) to intelligently search this space.
- For each set of hyperparameters evaluated by Hyperopt, perform k-fold cross-validation on the training data.
- The objective function for Hyperopt is the mean RMSE across these k folds.
- The set of hyperparameters yielding the minimum average cross-validated RMSE is selected as optimal.
- A final GP model is then refitted on the entire training dataset using these best-found hyperparameters.
This method directly targets the minimization of prediction error, which can be a more robust approach for many real-world regression tasks.
Features
- Automated hyperparameter optimization for GP models using Hyperopt.
- Cross-validation (k-fold) integrated into the optimization loop to find parameters that generalize well.
- Directly optimizes for mean cross-validated RMSE.
- Supports various GPflow kernels (e.g., RBF, Matern32, Matern52, RationalQuadratic by default, easily extensible).
- Data-dependent default hyperparameter search space generation based on the target variable's statistics.
- Handles mean centering of the target variable internally for potentially improved stability.
- Simple API: provide your preprocessed numerical
X_trainandy_traindata.
Installation
pip install bayesian-gp-cvloss
Alternatively, to install the latest version directly from the source (e.g., for development):
git clone https://github.com/Shifa-Zhong/bayesian-gp-cvloss.git
cd bayesian-gp-cvloss
pip install .
Dependencies
- gpflow >= 2.0.0
- hyperopt >= 0.2.0
- scikit-learn >= 0.23.0
- pandas >= 1.0.0
- numpy >= 1.18.0
Users are responsible for their own data preprocessing (e.g., encoding categorical features, feature scaling) before using this library. The optimizer expects purely numerical X_train and y_train inputs.
Quick Start
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from bayesian_gp_cvloss import GPCrossValidatedOptimizer
# 0. (User Responsibility) Load and Preprocess Data
# Ensure X is purely numerical. All encoding and scaling is up to the user.
# Create some synthetic data for demonstration
np.random.seed(42)
N_train = 100
N_features = 3
X_synth = np.random.rand(N_train, N_features)
y_synth = np.sin(X_synth[:, 0] * 2 * np.pi) + X_synth[:, 1]**2 + np.random.randn(N_train) * 0.1
X_df = pd.DataFrame(X_synth, columns=[f'feature_{i}' for i in range(N_features)])
y_series = pd.Series(y_synth, name='target')
# Split data
X_train_data, X_test_data, y_train_data, y_test_data = train_test_split(
X_df, y_series, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_data)
X_test_scaled = scaler.transform(X_test_data)
y_train_np = y_train_data.values
# 1. Initialize the Optimizer
# Pass preprocessed X_train and y_train directly to the constructor.
# A data-dependent default hyperparameter search space is generated automatically.
optimizer = GPCrossValidatedOptimizer(
X_train=X_train_scaled,
y_train=y_train_np,
n_splits=5, # Number of CV folds
random_state=42 # For reproducibility
)
# 2. Run Optimization
# This finds the best hyperparameters based on cross-validated RMSE
# and automatically refits a final model on the full training data.
best_params = optimizer.optimize(max_evals=50)
print(f"Best hyperparameters found: {best_params}")
# Access the best trial's CV RMSE from the trials object
trials = optimizer.get_optimization_results()
if trials.best_trial:
print(f"Best CV RMSE: {trials.best_trial['result']['loss']:.4f}")
print(f"Best CV Train RMSE: {trials.best_trial['result']['train_loss']:.4f}")
# 3. Make Predictions
# The predict method uses the refitted model and returns predictions
# on the original (uncentered) scale.
y_pred_test, y_pred_var_test = optimizer.predict(X_test_scaled)
# 4. Evaluate
from sklearn.metrics import mean_squared_error
rmse_test = np.sqrt(mean_squared_error(y_test_data.values, y_pred_test))
print(f"Test RMSE: {rmse_test:.4f}")
How it Works Internally
__init__(X_train, y_train, hyperopt_space=None, n_splits=5, random_state=None): Stores the preprocessed training data, computesy_train_mean_for internal centering, and generates a data-dependent default hyperparameter search space ifhyperopt_spaceis not provided.optimize(max_evals=100, tpe_algo=tpe.suggest, early_stop_fn=None, rstate_seed=None):- Initializes
hyperopt.Trials(). - Runs
hyperopt.fmin()with the_objectivefunction, the defined search space,tpe.suggestalgorithm, andmax_evals. - Stores the best parameters in
self.best_params. - Calls
refit_best_model()to train a final GPR model on the full training data usingself.best_params. - Returns
self.best_params.
- Initializes
_objective(params):- This is the function minimized by Hyperopt.
- It takes a dictionary of
params(hyperparameters for a single trial). - Performs k-fold cross-validation:
- For each fold, splits
X_train,y_traininto training and validation subsets. - Important: The target variable in each fold is centered by subtracting the mean of the current fold's training target.
- Constructs a GPflow GPR model using the hyperparameters from
paramsand the current fold's training data. - Predicts on the validation fold and calculates RMSE.
- For each fold, splits
- Averages the RMSEs from all validation folds.
- Returns a dictionary including
{'loss': avg_val_rmse, 'status': STATUS_OK, ...}.
_get_default_data_dependent_space():- Defines the search space for Hyperopt for each hyperparameter:
lengthscales_{i}:hp.quniformbetween 0.1 and 100 (step 0.01) for each input dimension.kernel_variance:hp.uniformbetween 1e-6 andy_train.var().likelihood_noise_variance:hp.loguniformbetween(y_train.std()/100)**2and(y_train.std()/2)**2(with safety checks for small/zero std dev).kernel_name:hp.choiceamong the default kernels (Matern32, Matern52, RBF, RationalQuadratic).
- Defines the search space for Hyperopt for each hyperparameter:
refit_best_model():- Trains a new GPflow GPR model using
self.best_paramson the entire training data (centered usingself.y_train_mean_). - Stores this model as
self.best_model_.
- Trains a new GPflow GPR model using
predict(X_new_processed):- Takes new, preprocessed data
X_new_processed. - Uses
self.best_model_to predict mean and variance. - Adds back
self.y_train_mean_to the predicted mean to return predictions on the original scale. - Returns
(pred_mean, pred_var)as NumPy arrays.
- Takes new, preprocessed data
Customization
- Kernels: Modify
DEFAULT_KERNELSinbayesian_gp_cvloss.optimizeror provide a customhyperopt_spacewith your desiredkernel_namechoices. - Hyperparameter Space: Pass a custom
hyperopt_spacedictionary to theGPCrossValidatedOptimizerconstructor. The space must include keys forlengthscales_{i}(for each feature),kernel_variance,likelihood_noise_variance, andkernel_name. - Cross-Validation: Change
n_splitsandrandom_statein the constructor. - Hyperopt: Adjust
max_evalsandrstate_seedin theoptimize()method.
Contributing
Contributions are welcome! If you have suggestions for improvements or find any issues, please open an issue or submit a pull request to the GitHub repository: https://github.com/Shifa-Zhong/bayesian-gp-cvloss
License
This project is licensed under the MIT License - see the LICENSE file for details.
Author
Shifa Zhong (sfzhong@tongji.edu.cn) GitHub: Shifa-Zhong
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bayesian_gp_cvloss-0.1.6.tar.gz.
File metadata
- Download URL: bayesian_gp_cvloss-0.1.6.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0361a68bcae017c83318c29f1b8dfae67f07380106bb38b2ac44cd6a769cb3dd
|
|
| MD5 |
0f2e28cbd7b536110e0fb068c587b4ee
|
|
| BLAKE2b-256 |
db86872684303ca796e8591c403598e00f6e7a6d203e62f2b9f90e3d42aebe37
|
File details
Details for the file bayesian_gp_cvloss-0.1.6-py3-none-any.whl.
File metadata
- Download URL: bayesian_gp_cvloss-0.1.6-py3-none-any.whl
- Upload date:
- Size: 10.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78f21881ca60ed8448dd88507c9b2812ce778ca44de03b894b07c14f75d45dc1
|
|
| MD5 |
6b61db000d8c98ecee1f78aed1a58ea7
|
|
| BLAKE2b-256 |
1295704821e481e2b517c7f191720bd25f7510f79e7af143329efac6f037d637
|