A lightweight gradient boosting implementation in Rust.
Project description
Forust
A lightweight gradient boosting package
Forust, is a lightweight package for building gradient boosted decision tree ensembles. All of the algorithm code is written in Rust, with a python wrapper. The rust package can be used directly, however, most examples shown here will be for the python wrapper. For a self contained rust example, see here. It implements the same algorithm as the XGBoost package, and in many cases will give nearly identical results.
I developed this package for a few reasons, mainly to better understand the XGBoost algorithm, additionally to have a fun project to work on in rust, and because I wanted to be able to experiment with adding new features to the algorithm in a smaller simpler codebase.
All of the rust code for the package can be found in the src directory, while all of the python wrapper code is in the py-forust directory.
Installation
The package can be installed directly from pypi.
pip install forust
To use in a rust project add the following to your Cargo.toml file.
forust-ml = "0.2.7"
Usage
The GradientBooster
class is currently the only public facing class in the package, and can be used to train gradient boosted decision tree ensembles with multiple objective functions.
It can be initialized with the following arguments.
objective_type
(str, optional): The name of objective function used to optimize. Valid options include "LogLoss" to use logistic loss as the objective function (binary classification), or "SquaredLoss" to use Squared Error as the objective function (continuous regression). Defaults to "LogLoss".iterations
(int, optional): Total number of trees to train in the ensemble. Defaults to 100.learning_rate
(float, optional): Step size to use at each iteration. Each leaf weight is multiplied by this number. The smaller the value, the more conservative the weights will be. Defaults to 0.3.max_depth
(int, optional): Maximum depth of an individual tree. Valid values are 0 to infinity. Defaults to 5.max_leaves
(int, optional): Maximum number of leaves allowed on a tree. Valid values are 0 to infinity. This is the total number of final nodes. Defaults to sys.maxsize.l2
(float, optional): L2 regularization term applied to the weights of the tree. Valid values are 0 to infinity. Defaults to 1.0.gamma
(float, optional): The minimum amount of loss required to further split a node. Valid values are 0 to infinity. Defaults to 0.0.min_leaf_weight
(float, optional): Minimum sum of the hessian values of the loss function required to be in a node. Defaults to 1.0.base_score
(float, optional): The initial prediction value of the model. Defaults to 0.5.nbins
(int, optional): Number of bins to calculate to partition the data. Setting this to a smaller number, will result in faster training time, while potentially sacrificing accuracy. If there are more bins, than unique values in a column, all unique values will be used. Defaults to 256.parallel
(bool, optional): Should multiple cores be used when training and predicting with this model? Defaults toTrue
.allow_missing_splits
(bool, optional): Allow for splits to be made such that all missing values go down one branch, and all non-missing values go down the other, if this results in the greatest reduction of loss. If this is false, splits will only be made on non missing values. Ifcreate_missing_branch
is set toTrue
having this parameter be set toTrue
will result in the missing branch further split, if this parameter isFalse
then in that case the missing branch will always be a terminal node. Defaults toTrue
.monotone_constraints
(dict[Any, int], optional): Constraints that are used to enforce a specific relationship between the training features and the target variable. A dictionary should be provided where the keys are the feature index value if the model will be fit on a numpy array, or a feature name if it will be fit on a pandas Dataframe. The values of the dictionary should be an integer value of -1, 1, or 0 to specify the relationship that should be estimated between the respective feature and the target variable. Use a value of -1 to enforce a negative relationship, 1 a positive relationship, and 0 will enforce no specific relationship at all. Features not included in the mapping will not have any constraint applied. IfNone
is passed no constraints will be enforced on any variable. Defaults toNone
.subsample
(float, optional): Percent of records to randomly sample at each iteration when training a tree. Defaults to 1.0, meaning all data is used for training.seed
(integer, optional): Integer value used to seed any randomness used in the algorithm. Defaults to 0.missing
(float, optional): Value to consider missing, when training and predicting with the booster. Defaults tonp.nan
.create_missing_branch
(bool, optional): An experimental parameter, that ifTrue
, will create a separate branch for missing, creating a ternary tree, the missing node will be given the same weight value as the parent node. If this parameter isFalse
, missing will be sent down either the left or right branch, creating a binary tree. Defaults toFalse
.sample_method
(str | None, optional): Optional string value to use to determine the method to use to sample the data while training. If this is None, no sample method will be used. If thesubsample
parameter is less than 1 and no sample_method is provided thissample_method
will be automatically set to "random". Valid options are "goss" and "random". Defaults toNone
.evaluation_metric
(str | None, optional): Optional string value used to define an evaluation metric that will be calculated at each iteration if aevaluation_dataset
is provided at fit time. The metric can be one of "AUC", "LogLoss", "RootMeanSquaredLogError", or "RootMeanSquaredError". If noevaluation_metric
is passed, but anevaluation_dataset
is passed, then "LogLoss", will be used with the "LogLoss" objective function, and "RootMeanSquaredLogError" will be used with "SquaredLoss".early_stopping_rounds
(int | None, optional): If this is specified, and anevaluation_dataset
is passed during fit, then an improvement in theevaluation_metric
must be seen after at least this many iterations of training, otherwise training will be cut short.
Training and Predicting
Once, the booster has been initialized, it can be fit on a provided dataset, and performance field. After fitting, the model can be used to predict on a dataset. In the case of this example, the predictions are the log odds of a given record being 1.
# Small example dataset
from seaborn import load_dataset
df = load_dataset("titanic")
X = df.select_dtypes("number").drop(columns=["survived"])
y = df["survived"]
# Initialize a booster with defaults.
from forust import GradientBooster
model = GradientBooster(objective_type="LogLoss")
model.fit(X, y)
# Predict on data
model.predict(X.head())
# array([-1.94919663, 2.25863229, 0.32963671, 2.48732194, -3.00371813])
# predict contributions
model.predict_contributions(X.head())
# array([[-0.63014213, 0.33880048, -0.16520798, -0.07798772, -0.85083578,
# -1.07720813],
# [ 1.05406709, 0.08825999, 0.21662544, -0.12083538, 0.35209258,
# -1.07720813],
The fit
method accepts the following arguments.
X
(FrameLike): Either a pandas DataFrame, or a 2 dimensional numpy array, with numeric data.y
(ArrayLike): Either a pandas Series, or a 1 dimensional numpy array. If "LogLoss" was the objective type specified, then this should only contain 1 or 0 values, where 1 is the positive class being predicted. If "SquaredLoss" is the objective type, then any continuous variable can be provided.sample_weight
(Optional[ArrayLike], optional): Instance weights to use when training the model. If None is passed, a weight of 1 will be used for every record. Defaults to None.evaluation_data
(tuple[FrameLike, ArrayLike, ArrayLike] | tuple[FrameLike, ArrayLike], optional): An optional list of tuples, where each tuple should contain a dataset, and equal length target array, and optional an equal length sample weight array. If this is provided metric values will be calculated at each iteration of training. Ifearly_stopping_rounds
is supplied, the first entry of this list will be used to determine if performance has improved over the last set of iterations, for which if no improvement is not seen inearly_stopping_rounds
training will be cut short.
The predict method accepts the following arguments.
X
(FrameLike): Either a pandas DataFrame, or a 2 dimensional numpy array, with numeric data.parallel
(Optional[bool], optional): Optionally specify if the predict function should run in parallel on multiple threads. IfNone
is passed, theparallel
attribute of the booster will be used. Defaults toNone
.
The predict_contributions
method will predict with the fitted booster on new data, returning the feature contribution matrix. The last column is the bias term.
X
(FrameLike): Either a pandas DataFrame, or a 2 dimensional numpy array, with numeric data.method
(str, optional): Method to calculate the contributions, if "average" is specified, the average internal node values are calculated, this is equivalent to theapprox_contribs
parameter in XGBoost. The other supported method is "weight", this will use the internal leaf weights, to calculate the contributions. This is the same as what is described by Saabas here.parallel
(Optional[bool], optional): Optionally specify if the predict function should run in parallel on multiple threads. IfNone
is passed, theparallel
attribute of the booster will be used. Defaults toNone
.
When predicting with the data, the maximum iteration that will be used when predicting can be set using the set_prediction_iteration
method. If early_stopping_rounds
has been set, this will default to the best iteration, otherwise all of the trees will be used. It accepts a single value.
iteration
(int): Iteration number to use, this will use all trees, up to and including this index.
If early stopping was used, the evaluation history can be retrieved with the get_evaluation_history
method.
model = GradientBooster(objective_type="LogLoss")
model.fit(X, y, evaluation_data=[(X, y)])
model.get_evaluation_history()[0:3]
# array([[588.9158873 ],
# [532.01055803],
# [496.76933646]])
Inspecting the Model
Once the booster has been fit, each individual tree structure can be retrieved in text form, using the text_dump
method. This method returns a list, the same length as the number of trees in the model.
model.text_dump()[0]
# 0:[0 < 3] yes=1,no=2,missing=2,gain=91.50833,cover=209.388307
# 1:[4 < 13.7917] yes=3,no=4,missing=4,gain=28.185467,cover=94.00148
# 3:[1 < 18] yes=7,no=8,missing=8,gain=1.4576768,cover=22.090348
# 7:[1 < 17] yes=15,no=16,missing=16,gain=0.691266,cover=0.705011
# 15:leaf=-0.15120,cover=0.23500
# 16:leaf=0.154097,cover=0.470007
The json_dump
method performs the same action, but returns the model as a json representation rather than a text string.
To see an estimate for how a given feature is used in the model, the partial_dependence
method is provided. This method calculates the partial dependence values of a feature. For each unique value of the feature, this gives the estimate of the predicted value for that feature, with the effects of all features averaged out. This information gives an estimate of how a given feature impacts the model.
The partial_dependence
method takes the following parameters...
X
(FrameLike): Either a pandas DataFrame, or a 2 dimensional numpy array. This should be the same data passed into the models fit, or predict, with the columns in the same order.feature
(Union[str, int]): The feature for which to calculate the partial dependence values. This can be the name of a column, if the provided X is a pandas DataFrame, or the index of the feature.samples
(int | None, optional): Number of evenly spaced samples to select. If None is passed all unique values will be used. Defaults to 100.exclude_missing
(bool, optional): Should missing excluded from the features? Defaults to True.percentile_bounds
(tuple[float, float], optional): Upper and lower percentiles to start at when calculating the samples. Defaults to (0.2, 0.98) to cap the samples selected at the 5th and 95th percentiles respectively. This method returns a 2 dimensional numpy array, where the first column is the sorted unique values of the feature, and then the second column is the partial dependence values for each feature value.
This information can be plotted to visualize how a feature is used in the model, like so.
from seaborn import lineplot
import matplotlib.pyplot as plt
pd_values = model.partial_dependence(X=X, feature="age", samples=None)
fig = lineplot(x=pd_values[:,0], y=pd_values[:,1],)
plt.title("Partial Dependence Plot")
plt.xlabel("Age")
plt.ylabel("Log Odds")
We can see how this is impacted if a model is created, where a specific constraint is applied to the feature using the monotone_constraint
parameter.
model = GradientBooster(
objective_type="LogLoss",
monotone_constraints={"age": -1},
)
model.fit(X, y)
pd_values = model.partial_dependence(X=X, feature="age")
fig = lineplot(
x=pd_values[:, 0],
y=pd_values[:, 1],
)
plt.title("Partial Dependence Plot with Monotonicity")
plt.xlabel("Age")
plt.ylabel("Log Odds")
Saving the model
To save and subsequently load a trained booster, the save_booster
and load_booster
methods can be used. Each accepts a path, which is used to write the model to. The model is saved and loaded as a json object.
trained_model.save_booster("model_path.json")
# To load a model from a json path.
loaded_model = GradientBooster.load_model("model_path.json")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for forust-0.2.7-cp311-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ba96fa7e1bfa2a5f687e2f1b115733e456043ee9cbaa19d1da1f0e901edb735 |
|
MD5 | b98eaae34c8614f219961650ce7e23f1 |
|
BLAKE2b-256 | ecba9de1f56ddc5fc8fde7d26e5f1c8f30c343280887962db950e1eeca800fb5 |
Hashes for forust-0.2.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bafd007a22c2a7cc15accfa70b97a9bebe40fd255b07c4295f64d8f9a3f83d5b |
|
MD5 | 07fa552af7a04bbdbfd3c2ee9a30a5f2 |
|
BLAKE2b-256 | de90b797aa4fa4344d3c1771605245100fb9e9a8f8101728dd8a0f89d2cc67a6 |
Hashes for forust-0.2.7-cp311-cp311-macosx_10_7_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dab4d791db5951dbd670c73c4affe71b0b9e2b77b8aacfc9d79a1adcc33c95f3 |
|
MD5 | 1e305f4a9dd2187f3fa05324003bed22 |
|
BLAKE2b-256 | 4d13c34bda9a78ca44e6f77caebe7b1125aeb043adfff0fe8a7cba309c2b0a44 |
Hashes for forust-0.2.7-cp310-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0b54d5145f7366114637db6bff44f431b54dcac9ae907e6464ed9f7ce9fd18f |
|
MD5 | 056b61d1b84cd2da929fc3237e51da2c |
|
BLAKE2b-256 | 58817adc224b8a5d93fef2880527329dcdf872f2f1b91b2fcbcedd4547526aae |
Hashes for forust-0.2.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 26f04b57c22c4296efd2fe8fd3ec0cfca94a2e0988433ea8a17c25fa14a47882 |
|
MD5 | 5998c5e23f90d3c16f717d81d2fcda4e |
|
BLAKE2b-256 | 3216d8e335c6bc806c1e7fa56ab66a7f2e88398b16938546738eb6b20ba4aa1f |
Hashes for forust-0.2.7-cp310-cp310-macosx_10_7_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3caa7cf26bd756179d6ba648ce54885c5e8291c7cd7a47c5f1ef9a44e3f41bd7 |
|
MD5 | f0040446d404c1ad8c48f8986dd48b9f |
|
BLAKE2b-256 | 07527c29837c338f995b93abd42f61213119df8e30d75bafb7710b8184a41463 |
Hashes for forust-0.2.7-cp39-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f3625ad1d93109364b0c4354304d6b9329cb8407470ad57e048667defc79410 |
|
MD5 | 813a14ed6a6f0584f968fac80396b451 |
|
BLAKE2b-256 | 938f4e31e39979bc547d057ccb9792dee13f4404dc3f9dbd965c49b3d4177db0 |
Hashes for forust-0.2.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15a062c6a1cc024e96548fbd705fa27166f059a78accd8fb257596691c188448 |
|
MD5 | c56ceaf3385da8a4409ec2e89e60a59e |
|
BLAKE2b-256 | af487719aa248c1bbb04aff4159bd3c85a95754b2f6185e4567ccfa5f8585def |
Hashes for forust-0.2.7-cp39-cp39-macosx_10_7_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b3c33a815dd65f0d4f699d24ea5654a0640bb63383215247c03e8d5fc1ada316 |
|
MD5 | bb3c56ffcc6a676caf0e314bec745bfa |
|
BLAKE2b-256 | 50f0350ae9fc8013260dda272baf12ca9e667f41da84e7776e86cfc11a7a4554 |
Hashes for forust-0.2.7-cp38-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 20027ec70dbc3d1c7f3245a9bc2763a4bdc129eab912a7940c3957e1a2e6b214 |
|
MD5 | ea481c0ddbc6cfa58c5a15f526e2b6de |
|
BLAKE2b-256 | ee0bc15db7a5aef0202988a4d172cd0285bce9cd561d6775b3078dc25476db19 |
Hashes for forust-0.2.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b45782cf5225f1fde8e80aa4ebe53402d4032e72d73b47f99cf9974200d343d7 |
|
MD5 | ef9387f2933e48c251ccc80d0aa02aba |
|
BLAKE2b-256 | 241e00955156580cb055cf7a51651536764173a389671168d6d89ec79521f99d |
Hashes for forust-0.2.7-cp38-cp38-macosx_10_7_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 13031566d6dd426c5b441632cfc0ce70fa1da09a93faaa1aa04db2e00d930748 |
|
MD5 | af56063377c529699a4cac10b28e0e50 |
|
BLAKE2b-256 | 68a28158c4f7154f41d598b4c3477d4f7afe12761d4b9e33d5c6bb583ca1f388 |