Skip to main content

LightGBM Classifier with integrated bayesian hyperparameter optimization

Project description

a gecko looking at the camera with bayesian math in white on a pink and green background

(100)gecs

Bayesian hyperparameter tuning for LGBMClassifier, LGBMRegressor, CatBoostClassifier and CatBoostRegressor with a scikit-learn API

Table of Contents

Project Overview

gecs is a tool to help automate the process of hyperparameter tuning for boosting classifiers and regressors, which can potentially save significant time and computational resources in model building and optimization processes. The GEC stands for Good Enough Classifier, which allows you to focus on other tasks such as feature engineering. If you deploy 100 of them, you get 100GECs.

Introduction

The primary class in this package is LightGEC, which is derived from LGBMClassifier. Like its parent, LightGEC can be used to build and train gradient boosting models, but with the added feature of automated bayesian hyperparameter optimization. It can be imported from gecs.lightgec and then used in place of LGBMClassifier, with the same API.

By default, LightGEC optimizes num_leaves, boosting_type, learning_rate, reg_alpha, reg_lambda, min_child_samples, min_child_weight, colsample_bytree, subsample_freq, subsample and optionallyn_estimators. Which hyperparameters to tune is fully customizable.

Installation

The installation requires cmake, which can be installed using apt on linux or brew on mac. Then you can install (100)gecs using pip.

pip install gecs

Usage

The LightGEC class provides the same API to the user as the LGBMClassifier class of lightgbm, and additionally:

  • the two additional parameters to the fit method:

    • n_iter: Defines the number of hyperparameter combinations that the model should try. More iterations could lead to better model performance, but at the expense of computational resources

    • fixed_hyperparameters: Allows the user to specify hyperparameters that the GEC should not optimize. By default, only n_estimators is fixed. Any of the LGBMClassifier init arguments can be fixed, and so can subsample_freq and subsample, but only jointly. This is done by passing the value bagging.

  • the methods serialize and deserialize, which stores the LightGEC state for the hyperparameter optimization process, but not the fitted LGBMClassifier parameters, to a json file. To store the boosted tree model itself, you have to provide your own serialization or use pickle

  • the methods freeze and unfreeze that turn the LightGEC functionally into a LGBMClassifier and back

Example

The default use of LightGEC would look like this:

from sklearn.datasets import load_iris
from gecs.lightgec import LightGEC # LGBMClassifier with hyperparameter optimization
from gecs.lightger import LightGER # LGBMRegressor with hyperparameter optimization
from gecs.catgec import CatGEC # CatBoostClassifier with hyperparameter optimization
from gecs.catger import CatGER # CatBoostRegressor with hyperparameter optimization


X, y = load_iris(return_X_y=True)


# fit and infer GEC
gec = LightGEC()
gec.fit(X, y)
yhat = gec.predict(X)


# manage GEC state
path = "./gec.json"
gec.serialize(path) # stores gec data and settings, but not underlying LGBMClassifier attributes
gec2 = LightGEC.deserialize(path, X, y) # X and y are necessary to fit the underlying LGBMClassifier
gec.freeze() # freeze GEC so that it behaves like a LGBMClassifier
gec.unfreeze() # unfreeze to enable GEC hyperparameter optimisation


# benchmark against LGBMClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

clf = LGBMClassifier()
lgbm_score = np.mean(cross_val_score(clf, X, y))

gec.freeze()
gec_score = np.mean(cross_val_score(gec, X, y))

print(f"{gec_score = }, {lgbm_score = }")
assert gec_score > lgbm_score, "GEC doesn't outperform LGBMClassifier"

#check what hyperparameter combinations were tried
gec.tried_hyperparameters()

Contributing

If you want to contribute, please reach out and I'll design a process around it.

License

MIT

Contact Information

You can find my contact information on my website: https://leonluithlen.eu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gecs-0.1.1.tar.gz (20.0 kB view details)

Uploaded Source

Built Distribution

gecs-0.1.1-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file gecs-0.1.1.tar.gz.

File metadata

  • Download URL: gecs-0.1.1.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.10 Linux/5.15.0-76-generic

File hashes

Hashes for gecs-0.1.1.tar.gz
Algorithm Hash digest
SHA256 66466364229ceffa210676147209c6e5a1d818561bfaae650c7800b2fa49871f
MD5 11d6e67c67da6c6aa9595fe2e0486561
BLAKE2b-256 4cc92eb1e10f5dee0eec05780d7b516e4c58100e3caa553020ed2f07f1089632

See more details on using hashes here.

File details

Details for the file gecs-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: gecs-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.10 Linux/5.15.0-76-generic

File hashes

Hashes for gecs-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cca6eab53d9566f1102cfa165a0c4eae50377deb6663ad08f35e3473849be426
MD5 dad99bd676426f171666b27508c54a62
BLAKE2b-256 6c12f63b4057c4ae80b9dbbef0057ba3810e10ef7c7fcc649d19a4a5bdac4491

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page