A lightweight gradient boosting implementation in Rust.

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Forust

A lightweight gradient boosting package

Forust, is a lightweight package for building gradient boosted decision tree ensembles. All of the algorithm code is written in Rust, with a python wrapper. The rust package can be used directly, however, most examples shown here will be for the python wrapper. It implements the same algorithm as the XGBoost package, and in many cases will give nearly identical results.

I developed this package for a few reasons, mainly to better understand the XGBoost algorithm, additionally to have a fun project to work on in rust, and because I wanted to be able to experiment with adding new features to the algorithm in a smaller simpler codebase.

Usage

The GradientBooster class is currently the only public facing class in the package, and can be used to train gradient boosted decision tree ensembles with multiple objective functions.

It can be initialized with the following arguments.

objective_type (str, optional): The name of objective function used to optimize. Valid options include "LogLoss" to use logistic loss as the objective function, or "SquaredLoss" to use Squared Error as the objective function. Defaults to "LogLoss".
iterations (int, optional): Total number of trees to train in the ensemble. Defaults to 100.
learning_rate (float, optional): Step size to use at each iteration. Each leaf weight is multiplied by this number. The smaller the value, the more conservative the weights will be. Defaults to 0.3.
max_depth (int, optional): Maximum depth of an individual tree. Valid values are 0 to infinity. Defaults to 5.
max_leaves (int, optional): Maximum number of leaves allowed on a tree. Valid values are 0 to infinity. This is the total number of final nodes. Defaults to sys.maxsize.
l2 (float, optional): L2 regularization term applied to the weights of the tree. Valid values are 0 to infinity. Defaults to 1.0.
gamma (float, optional): The minimum amount of loss required to further split a node. Valid values are 0 to infinity. Defaults to 0.0.
min_leaf_weight (float, optional): Minimum sum of the hessian values of the loss function required to be in a node. Defaults to 0.0.
base_score (float, optional): The initial prediction value of the model. Defaults to 0.5.
nbins (int, optional): Number of bins to calculate to partition the data. Setting this to a smaller number, will result in faster training time, while potentially sacrificing accuracy. If there are more bins, than unique values in a column, all unique values will be used. Defaults to 256.
parallel (bool, optional): Should multiple cores be used when training and predicting with this model? Defaults to True.
dtype (Union[np.dtype, str], optional): Datatype used for the model. Valid options are a numpy 32 bit float, or numpy 64 bit float. Using 32 bit float could be faster in some instances, however this may lead to less precise results. Defaults to "float64".

Once, the booster has been initialized, it can be fit on a provided dataset, and performance field. After fitting, the model can be used to predict on a dataset. In the case of this example, the predictions are the log odds of a given record being 1.

# Small example dataset
from seaborn import load_dataset

df = load_dataset("titanic")
X = df.select_dtypes("number").drop(column=["survived"])
y = df["survived"]

# Initialize a booster with defaults.
from forust import GradientBooster
model = GradientBooster(objective_type="LogLoss")
model.fit(X, y)

# Predict on data
model.predict(X.head())
# array([-1.94919663,  2.25863229,  0.32963671,  2.48732194, -3.00371813])

The fit method accepts the following arguments.

X (FrameLike): Either a pandas DataFrame, or a 2 dimensional numpy array, with numeric data.
y (ArrayLike): Either a pandas Series, or a 1 dimensional numpy array.
sample_weight (Optional[ArrayLike], optional): Instance weights to use when training the model. If None is passed, a weight of 1 will be used for every record. Defaults to None.

The predict method accepts the following arguments.

X (FrameLike): Either a pandas DataFrame, or a 2 dimensional numpy array, with numeric data.

Once the booster has been fit, each individual tree structure can be retrieved in text form, using the text_dump method. This method returns a list, the same length as the number of trees in the model.

model.text_dump()[0]
# 0:[0 < 3] yes=1,no=2,missing=2,gain=91.50833,cover=209.388307
#       1:[4 < 13.7917] yes=3,no=4,missing=4,gain=28.185467,cover=94.00148
#             3:[1 < 18] yes=7,no=8,missing=8,gain=1.4576768,cover=22.090348
#                   7:[1 < 17] yes=15,no=16,missing=16,gain=0.691266,cover=0.705011
#                         15:leaf=-0.15120,cover=0.23500
#                         16:leaf=0.154097,cover=0.470007

TODOs

This is still a work in progress

Early stopping rounds
- We should be able to accept a validation dataset, and this should be able to be used to determine when to stop training.
Monotonicity support
- Right now features are used in the model without any constraints.
Ability to save a model.
- The way the underlying trees are structured, they would lend themselves to being saved as JSon objects.
Clean up the CICD pipeline.

Project details

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.4.7

Apr 12, 2024

0.4.6

Mar 24, 2024

0.4.5

Dec 17, 2023

0.4.4

Dec 13, 2023

0.4.3

Dec 5, 2023

0.4.2

Oct 20, 2023

0.4.1

Oct 17, 2023

0.4.0

Oct 16, 2023

0.3.4

Oct 13, 2023

0.3.3

Oct 10, 2023

0.3.2

Oct 9, 2023

0.3.1

Oct 2, 2023

0.3.0

Oct 2, 2023

0.2.26

Sep 19, 2023

0.2.25

Sep 19, 2023

0.2.24

Sep 12, 2023

0.2.23

Sep 7, 2023

0.2.22

Sep 6, 2023

0.2.21

Aug 23, 2023

0.2.20

Aug 8, 2023

0.2.19

Aug 2, 2023

0.2.18

Jul 13, 2023

0.2.17

Jul 5, 2023

0.2.16

Jun 29, 2023

0.2.15

Jun 24, 2023

0.2.14

Jun 19, 2023

0.2.13

Jun 9, 2023

0.2.12

May 24, 2023

0.2.11

May 22, 2023

0.2.10

May 19, 2023

0.2.9

May 18, 2023

0.2.8

May 18, 2023

0.2.7

May 15, 2023

0.2.6

May 9, 2023

0.2.5

May 8, 2023

0.2.4

May 6, 2023

0.2.3

May 1, 2023

0.2.2

Apr 23, 2023

0.2.1

Apr 23, 2023

0.2.0

Apr 20, 2023

0.1.7

Aug 20, 2022

0.1.6

Aug 19, 2022

0.1.5

Jul 31, 2022

0.1.4

Jun 18, 2022

0.1.3

Jun 17, 2022

0.1.2

Jun 9, 2022

This version

0.1.0

Jun 8, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forust-0.1.0.tar.gz (561.5 kB view hashes)

Uploaded Jun 8, 2022 Source

Built Distributions

forust-0.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (439.8 kB view hashes)

Uploaded Jun 8, 2022 CPython 3.10 manylinux: glibc 2.5+ x86-64

forust-0.1.0-cp39-none-win_amd64.whl (363.5 kB view hashes)

Uploaded Jun 8, 2022 CPython 3.9 Windows x86-64

forust-0.1.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (440.2 kB view hashes)

Uploaded Jun 8, 2022 CPython 3.9 manylinux: glibc 2.5+ x86-64

forust-0.1.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (440.1 kB view hashes)

Uploaded Jun 8, 2022 CPython 3.8 manylinux: glibc 2.5+ x86-64

Hashes for forust-0.1.0.tar.gz

Hashes for forust-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ff65dc6668f06ced7dee1085dcae12b5658b2b55630e491a4e1debe19ea16d5f`
MD5	`cb559a219950769863804d88f7886450`
BLAKE2b-256	`718d9caf648d81047a5aa4f31cce322eaa891d25e6e074f56af89ba71dbd92ba`

Hashes for forust-0.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl

Hashes for forust-0.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`98c9410d6fc2e604031abb294ca0acc6fca15292e73fd5ea2a052d6a8d30da74`
MD5	`21300ba302216668710a502d467676cb`
BLAKE2b-256	`16877e3fc2c189cd841fa23ed9a2762995d8fc7698f10512e5c1c09e4852f610`

Hashes for forust-0.1.0-cp39-none-win_amd64.whl

Hashes for forust-0.1.0-cp39-none-win_amd64.whl
Algorithm	Hash digest
SHA256	`9472bd977be2c198af0d63271b55570d90c8482b1215852a60117578c6400ed6`
MD5	`91a090d1b748dc6130d3d6a1324457a8`
BLAKE2b-256	`b72c1886241486acb6740c06e4d0d8a3791f7e3824763571da8747b7a169416d`

Hashes for forust-0.1.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl

Hashes for forust-0.1.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`8a14d2689676ad4a76d7ad9d686e87e7d4178af9415e5d71f13773431d01a840`
MD5	`a80c546f0f7c9a2291f065af554dc40b`
BLAKE2b-256	`40c77dd20e45e47a3a26f85e55cbda79c9f8d50e21670f17685c6eeb54473521`

Hashes for forust-0.1.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl

Hashes for forust-0.1.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`f8910e280d423f0be89d90942d8243c8263efe3236a73e3c278d2a9d8e8ff37b`
MD5	`e203bcfe16f9651ede90217c0b919746`
BLAKE2b-256	`49929e2075c7511598bbf7f7f2eae607799bdc3be0e5d86fca790fd81a892012`