XGBoost for probabilistic prediction.
Project description
xgboost-distribution
XGBoost for probabilistic prediction. Like NGBoost, but faster, and in the XGBoost scikit-learn API.
Installation
$ pip install xgboost-distribution
python_requires = >=3.8
install_requires =
scikit-learn
xgboost>=2.1.0
Usage
XGBDistribution follows the XGBoost scikit-learn API, with an additional keyword argument specifying the distribution, which is fit via Maximum Likelihood Estimation:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from xgboost_distribution import XGBDistribution
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = XGBDistribution(
distribution="normal",
n_estimators=500,
early_stopping_rounds=10
)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)])
See the documentation for all available distributions.
After fitting, we can predict the parameters of the distribution:
preds = model.predict(X_test)
mean, std = preds.loc, preds.scale
Note that this returned a namedtuple of numpy arrays for each parameter of the distribution (we use the scipy stats naming conventions for the parameters, see e.g. scipy.stats.norm for the normal distribution).
NGBoost performance comparison
XGBDistribution follows the method shown in the NGBoost library, using natural gradients to estimate the parameters of the distribution.
Below, we show a performance comparison of XGBDistribution and the NGBoost NGBRegressor, using the California Housing dataset, estimating normal distributions. While the performance of the two models is fairly similar (measured on negative log-likelihood of a normal distribution and the RMSE), XGBDistribution is around 15x faster (timed on both fit and predict steps):
Please see the experiments page for results across various datasets.
Full XGBoost features
XGBDistribution offers the full set of XGBoost features available in the XGBoost scikit-learn API, allowing, for example, probabilistic regression with monotonic constraints:
Acknowledgements
This package would not exist without the excellent work from:
NGBoost - Which demonstrated how gradient boosting with natural gradients can be used to estimate parameters of distributions. Much of the gradient calculations code were adapted from there.
XGBoost - Which provides the gradient boosting algorithms used here, in particular the sklearn APIs were taken as a blue-print.
Note
This project has been set up using PyScaffold 4.0.1. For details and usage information on PyScaffold see https://pyscaffold.org/.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file xgboost_distribution-0.3.0.tar.gz
.
File metadata
- Download URL: xgboost_distribution-0.3.0.tar.gz
- Upload date:
- Size: 212.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54ae19c936ffb885c0cee6397269ad38f45f1e18622b37976e6ce33e7cfcfb4f |
|
MD5 | 599be10885585bbd40fb2ea0d5f2bebb |
|
BLAKE2b-256 | 56d289ef0f726e61a36f709b9b0956082c9cc5189818c958d70f0439fa5417c4 |
File details
Details for the file xgboost_distribution-0.3.0-py2.py3-none-any.whl
.
File metadata
- Download URL: xgboost_distribution-0.3.0-py2.py3-none-any.whl
- Upload date:
- Size: 19.0 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 70365dfff78630259a1ecd9083b0ccdac7f470d7c61421a1d4fc6f629f936d79 |
|
MD5 | 31c275b059d8da2903ddd79c8b3c7ac1 |
|
BLAKE2b-256 | f172c521593f983df4ffdfabb05a5bf1d3caa5d3afffab92d81bd6d0e2486ec9 |