A Python package for Gaussian Process Regression with hyperparameter optimization using Hyperopt and cross-validation, focusing on optimizing cross-validated loss.
Project description
Bayesian GP CVLoss: Gaussian Process Regression with Cross-Validated Hyperparameter Optimization
bayesian_gp_cvloss is a Python package designed to simplify the process of training Gaussian Process (GP) models by finding optimal hyperparameters through Bayesian optimization (using Hyperopt) with k-fold cross-validation. The key feature of this package is its direct optimization of the cross-validated loss, aligning the hyperparameter tuning process closely with the model's predictive performance.
This package is particularly useful for researchers and practitioners who want to apply GP models without manually tuning hyperparameters or relying solely on maximizing marginal likelihood, offering a more direct approach to achieving good generalization on unseen data.
Core Idea
The traditional approach to training GP models often involves maximizing the log marginal likelihood of the model parameters. While effective, this doesn't always directly translate to the best predictive performance on unseen data, especially when the model assumptions are not perfectly met or when working with smaller datasets.
This library implements an alternative strategy:
- Define a search space for the GP kernel parameters (e.g., length scales, kernel variance) and likelihood parameters (e.g., noise variance).
- Use Bayesian optimization (Hyperopt) to intelligently search this space.
- For each set of hyperparameters evaluated by Hyperopt, perform k-fold cross-validation on the training data.
- The objective function is configurable: cross-validated RMSE, Negative Log Predictive Density (NLPD), or a weighted combination.
- The set of hyperparameters yielding the minimum loss is selected as optimal.
- A final GP model is then refitted on the entire training dataset using these best-found hyperparameters.
Features
- Automated hyperparameter optimization for GP models using Hyperopt.
- Cross-validation (k-fold) integrated into the optimization loop.
- Three scoring objectives:
"cv_rmse"— Minimise cross-validated RMSE (prediction accuracy)."nlpd"— Minimise Negative Log Predictive Density (prediction accuracy + uncertainty calibration)."combined"— Weighted combination of both, balancing accuracy and calibration.
- Automatic Leave-One-Out (LOO): when the dataset is smaller than
n_splits, the splitter falls back to LOO automatically. - Supports various GPflow kernels (RBF, Matern32, Matern52, RationalQuadratic by default).
- Smart data-dependent defaults: search ranges are automatically computed from the training data.
- Flexible overrides: fine-tune individual search ranges without building a full Hyperopt space.
- Simple API: provide your preprocessed numerical
X_trainandy_traindata.
Installation
pip install bayesian-gp-cvloss
Alternatively, install from source:
git clone https://github.com/Shifa-Zhong/bayesian-gp-cvloss.git
cd bayesian-gp-cvloss
pip install .
Dependencies
- gpflow >= 2.0.0
- hyperopt >= 0.2.0
- scikit-learn >= 0.23.0
- pandas >= 1.0.0
- numpy >= 1.18.0
Quick Start
import numpy as np
from bayesian_gp_cvloss import GPCrossValidatedOptimizer
# Create synthetic data
np.random.seed(42)
X = np.random.rand(100, 3)
y = np.sin(X[:, 0] * 2 * np.pi) + X[:, 1]**2 + np.random.randn(100) * 0.1
# --- Option A: Classic RMSE objective (default, backward-compatible) ---
optimizer = GPCrossValidatedOptimizer(
X_train=X, y_train=y,
n_splits=5, random_state=42
)
best_params = optimizer.optimize(max_evals=50)
# --- Option B: NLPD objective (accuracy + uncertainty calibration) ---
optimizer = GPCrossValidatedOptimizer(
X_train=X, y_train=y,
scoring="nlpd", # <-- NEW
n_splits=5, random_state=42
)
best_params = optimizer.optimize(max_evals=50)
# --- Option C: Combined objective ---
optimizer = GPCrossValidatedOptimizer(
X_train=X, y_train=y,
scoring="combined", # <-- NEW
nlpd_weight=0.5, # <-- NEW: weight for NLPD term
n_splits=5, random_state=42
)
best_params = optimizer.optimize(max_evals=50)
# Access results — both RMSE and NLPD are always recorded
trials = optimizer.get_optimization_results()
if trials.best_trial:
result = trials.best_trial['result']
print(f"Best CV RMSE: {result['cv_rmse']:.4f}")
print(f"Best CV NLPD: {result['cv_nlpd']:.4f}")
print(f"Best Train RMSE: {result['train_loss']:.4f}")
# Predict
y_pred, y_var = optimizer.predict(X_test)
Scoring Objectives Explained
"cv_rmse" (default)
Minimises the mean cross-validated Root Mean Squared Error. This directly targets prediction accuracy and is equivalent to the behaviour of v0.1.x.
"nlpd" — Negative Log Predictive Density
Treats the GP prediction as a Gaussian distribution N(mu, sigma^2) and evaluates how likely the true observation is under that distribution:
NLPD = 0.5 * log(2*pi) + 0.5 * log(sigma^2) + 0.5 * (y - mu)^2 / sigma^2
This simultaneously penalises:
- Inaccurate means: large
(y - mu)^2 - Overconfident predictions: small
sigma^2when the prediction is wrong - Underconfident predictions: large
sigma^2when the prediction is right
This is particularly important for Bayesian optimisation, where acquisition functions (EI, UCB, etc.) depend on both the predicted mean and variance.
"combined"
A weighted sum of normalised RMSE and NLPD:
loss = (1 - nlpd_weight) * norm_RMSE + nlpd_weight * norm_NLPD
Both metrics are min-max normalised using the optimisation history so that the weight is meaningful regardless of scale. The default nlpd_weight=0.5 gives equal importance to accuracy and calibration.
Automatic Leave-One-Out (LOO)
When the training set has fewer samples than n_splits, the optimizer automatically switches to Leave-One-Out cross-validation. This avoids empty validation folds and provides the most data-efficient evaluation for very small datasets (common in materials optimisation with expensive experiments).
# With only 8 samples and n_splits=10, LOO is used automatically
optimizer = GPCrossValidatedOptimizer(
X_train=X_small, # shape (8, 3)
y_train=y_small,
n_splits=10, # Auto-switches to LOO (8 folds)
random_state=42
)
Customization
- Scoring:
scoring="cv_rmse","nlpd", or"combined". - NLPD weight:
nlpd_weight=0.5(only for"combined"mode). - Kernels:
kernels=["RBF", "Matern52"]to search only specific kernels. - Lengthscale range:
lengthscale_bounds=(0.05, 50.0). - Kernel variance range:
kernel_variance_bounds=(1e-4, 10.0). - Noise variance range:
noise_variance_bounds=(1e-6, 1.0). - Full custom space:
hyperopt_space={...}for complete control. - Cross-Validation:
n_splitsandrandom_state. - Hyperopt:
max_evalsandrstate_seedinoptimize().
Contributing
Contributions are welcome! If you have suggestions for improvements or find any issues, please open an issue or submit a pull request to the GitHub repository: https://github.com/Shifa-Zhong/bayesian-gp-cvloss
License
This project is licensed under the MIT License - see the LICENSE file for details.
Author
Shifa Zhong (sfzhong@tongji.edu.cn) GitHub: Shifa-Zhong
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bayesian_gp_cvloss-0.2.0.tar.gz.
File metadata
- Download URL: bayesian_gp_cvloss-0.2.0.tar.gz
- Upload date:
- Size: 17.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23777e37fa1fccd7adfe2bafa55e45a5359d551fb1524758367143f476032789
|
|
| MD5 |
cb84611f18c33ac23408bd3379a8044e
|
|
| BLAKE2b-256 |
d5891a2ed2fa9c9c4dca94ad774dea1366b12f7641d598ea1bb7c6e3a238acc6
|
File details
Details for the file bayesian_gp_cvloss-0.2.0-py3-none-any.whl.
File metadata
- Download URL: bayesian_gp_cvloss-0.2.0-py3-none-any.whl
- Upload date:
- Size: 15.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f83552d0097a3d3bbd8d2648d890137b68d87ff633eac0d8ce0a87748b16d8b
|
|
| MD5 |
0cde9c61155e7513f1abb63153f4ee1a
|
|
| BLAKE2b-256 |
a08fb2ab2a0d8ea4d542ccf4eab98e318ae7717c5433fc56621d6048e4233d72
|