Skip to main content

A Python package for Gaussian Process Regression with hyperparameter optimization using Hyperopt and cross-validation, focusing on optimizing cross-validated loss.

Project description

Bayesian GP CVLoss: Gaussian Process Regression with Cross-Validated Hyperparameter Optimization

PyPI version

bayesian_gp_cvloss is a Python package designed to simplify the process of training Gaussian Process (GP) models by finding optimal hyperparameters through Bayesian optimization (using Hyperopt) with k-fold cross-validation. The key feature of this package is its direct optimization of the cross-validated loss, aligning the hyperparameter tuning process closely with the model's predictive performance.

This package is particularly useful for researchers and practitioners who want to apply GP models without manually tuning hyperparameters or relying solely on maximizing marginal likelihood, offering a more direct approach to achieving good generalization on unseen data.

Core Idea

The traditional approach to training GP models often involves maximizing the log marginal likelihood of the model parameters. While effective, this doesn't always directly translate to the best predictive performance on unseen data, especially when the model assumptions are not perfectly met or when working with smaller datasets.

This library implements an alternative strategy:

  1. Define a search space for the GP kernel parameters (e.g., length scales, kernel variance) and likelihood parameters (e.g., noise variance).
  2. Use Bayesian optimization (Hyperopt) to intelligently search this space.
  3. For each set of hyperparameters evaluated by Hyperopt, perform k-fold cross-validation on the training data.
  4. The objective function is configurable: cross-validated RMSE, Negative Log Predictive Density (NLPD), or a weighted combination.
  5. The set of hyperparameters yielding the minimum loss is selected as optimal.
  6. A final GP model is then refitted on the entire training dataset using these best-found hyperparameters.

Features

  • Automated hyperparameter optimization for GP models using Hyperopt.
  • Cross-validation (k-fold) integrated into the optimization loop.
  • Three scoring objectives:
    • "cv_rmse" — Minimise cross-validated RMSE (prediction accuracy).
    • "nlpd" — Minimise Negative Log Predictive Density (prediction accuracy + uncertainty calibration).
    • "combined" — Weighted combination of both, balancing accuracy and calibration.
  • Automatic Leave-One-Out (LOO): when the dataset is smaller than n_splits, the splitter falls back to LOO automatically.
  • Supports various GPflow kernels (RBF, Matern32, Matern52, RationalQuadratic by default).
  • Smart data-dependent defaults: search ranges are automatically computed from the training data.
  • Flexible overrides: fine-tune individual search ranges without building a full Hyperopt space.
  • Simple API: provide your preprocessed numerical X_train and y_train data.

Installation

pip install bayesian-gp-cvloss

Alternatively, install from source:

git clone https://github.com/Shifa-Zhong/bayesian-gp-cvloss.git
cd bayesian-gp-cvloss
pip install .

Dependencies

  • gpflow >= 2.0.0
  • hyperopt >= 0.2.0
  • scikit-learn >= 0.23.0
  • pandas >= 1.0.0
  • numpy >= 1.18.0

Quick Start

import numpy as np
from bayesian_gp_cvloss import GPCrossValidatedOptimizer

# Create synthetic data
np.random.seed(42)
X = np.random.rand(100, 3)
y = np.sin(X[:, 0] * 2 * np.pi) + X[:, 1]**2 + np.random.randn(100) * 0.1

# --- Option A: Classic RMSE objective (default, backward-compatible) ---
optimizer = GPCrossValidatedOptimizer(
    X_train=X, y_train=y,
    n_splits=5, random_state=42
)
best_params = optimizer.optimize(max_evals=50)

# --- Option B: NLPD objective (accuracy + uncertainty calibration) ---
optimizer = GPCrossValidatedOptimizer(
    X_train=X, y_train=y,
    scoring="nlpd",           # <-- NEW
    n_splits=5, random_state=42
)
best_params = optimizer.optimize(max_evals=50)

# --- Option C: Combined objective ---
optimizer = GPCrossValidatedOptimizer(
    X_train=X, y_train=y,
    scoring="combined",       # <-- NEW
    nlpd_weight=0.5,          # <-- NEW: weight for NLPD term
    n_splits=5, random_state=42
)
best_params = optimizer.optimize(max_evals=50)

# Access results — both RMSE and NLPD are always recorded
trials = optimizer.get_optimization_results()
if trials.best_trial:
    result = trials.best_trial['result']
    print(f"Best CV RMSE: {result['cv_rmse']:.4f}")
    print(f"Best CV NLPD: {result['cv_nlpd']:.4f}")
    print(f"Best Train RMSE: {result['train_loss']:.4f}")

# Predict
y_pred, y_var = optimizer.predict(X_test)

Scoring Objectives Explained

"cv_rmse" (default)

Minimises the mean cross-validated Root Mean Squared Error. This directly targets prediction accuracy and is equivalent to the behaviour of v0.1.x.

"nlpd" — Negative Log Predictive Density

Treats the GP prediction as a Gaussian distribution N(mu, sigma^2) and evaluates how likely the true observation is under that distribution:

NLPD = 0.5 * log(2*pi) + 0.5 * log(sigma^2) + 0.5 * (y - mu)^2 / sigma^2

This simultaneously penalises:

  • Inaccurate means: large (y - mu)^2
  • Overconfident predictions: small sigma^2 when the prediction is wrong
  • Underconfident predictions: large sigma^2 when the prediction is right

This is particularly important for Bayesian optimisation, where acquisition functions (EI, UCB, etc.) depend on both the predicted mean and variance.

"combined"

A weighted sum of normalised RMSE and NLPD:

loss = (1 - nlpd_weight) * norm_RMSE + nlpd_weight * norm_NLPD

Both metrics are min-max normalised using the optimisation history so that the weight is meaningful regardless of scale. The default nlpd_weight=0.5 gives equal importance to accuracy and calibration.

Automatic Leave-One-Out (LOO)

When the training set has fewer samples than n_splits, the optimizer automatically switches to Leave-One-Out cross-validation. This avoids empty validation folds and provides the most data-efficient evaluation for very small datasets (common in materials optimisation with expensive experiments).

# With only 8 samples and n_splits=10, LOO is used automatically
optimizer = GPCrossValidatedOptimizer(
    X_train=X_small,  # shape (8, 3)
    y_train=y_small,
    n_splits=10,      # Auto-switches to LOO (8 folds)
    random_state=42
)

Customization

  • Scoring: scoring="cv_rmse", "nlpd", or "combined".
  • NLPD weight: nlpd_weight=0.5 (only for "combined" mode).
  • Kernels: kernels=["RBF", "Matern52"] to search only specific kernels.
  • Lengthscale range: lengthscale_bounds=(0.05, 50.0).
  • Kernel variance range: kernel_variance_bounds=(1e-4, 10.0).
  • Noise variance range: noise_variance_bounds=(1e-6, 1.0).
  • Full custom space: hyperopt_space={...} for complete control.
  • Cross-Validation: n_splits and random_state.
  • Hyperopt: max_evals and rstate_seed in optimize().

Contributing

Contributions are welcome! If you have suggestions for improvements or find any issues, please open an issue or submit a pull request to the GitHub repository: https://github.com/Shifa-Zhong/bayesian-gp-cvloss

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Shifa Zhong (sfzhong@tongji.edu.cn) GitHub: Shifa-Zhong

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bayesian_gp_cvloss-0.2.0.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bayesian_gp_cvloss-0.2.0-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file bayesian_gp_cvloss-0.2.0.tar.gz.

File metadata

  • Download URL: bayesian_gp_cvloss-0.2.0.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for bayesian_gp_cvloss-0.2.0.tar.gz
Algorithm Hash digest
SHA256 23777e37fa1fccd7adfe2bafa55e45a5359d551fb1524758367143f476032789
MD5 cb84611f18c33ac23408bd3379a8044e
BLAKE2b-256 d5891a2ed2fa9c9c4dca94ad774dea1366b12f7641d598ea1bb7c6e3a238acc6

See more details on using hashes here.

File details

Details for the file bayesian_gp_cvloss-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for bayesian_gp_cvloss-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4f83552d0097a3d3bbd8d2648d890137b68d87ff633eac0d8ce0a87748b16d8b
MD5 0cde9c61155e7513f1abb63153f4ee1a
BLAKE2b-256 a08fb2ab2a0d8ea4d542ccf4eab98e318ae7717c5433fc56621d6048e4233d72

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page