Skip to main content

HRBoost: Hierarchical Refined Boost - GBDT with Non-monotonic Bayesian Hierarchical Clustering

Project description

HRBoost (Hierarchical Refined Boost)

HRBoost is a fast, lightweight Gradient Boosting Decision Tree (GBDT) library built in C++ and Python. It introduces a Non-monotonic Bayesian Hierarchical Clustering (LNM-BHC, $k=3$) algorithm inside its core engine to find optimal splits for high-cardinality categorical variables with zero manual parameter tuning.

HRBoost is 100% compliant with the scikit-learn API, offering both HRBoostClassifier and HRBoostRegressor.


Installation

pip install hrboost

Hyperparameter Reference

HRBoostClassifier and HRBoostRegressor accept the following parameters in their constructors:

Core GBDT Parameters

  • n_estimators (int, default=200): The number of boosting rounds (trees to build).
  • learning_rate (float, default=0.1): Shrinkage rate applied to each tree's update to prevent overfitting.
  • max_depth (int, default=4): Maximum depth of each decision tree.
  • max_leaves (int, default=64): Maximum number of leaves allowed per tree.
  • reg_lambda (float, default=1.0): L2 regularization term on weights. It also scales the baseline regularization for Bayesian Hierarchical Clustering.
  • subsample (float, default=0.8): Fraction of training samples randomly chosen to train each tree.
  • colsample_bytree (float, default=1.0): Fraction of features randomly selected for building each tree.
  • n_bins (int, default=32): Maximum number of discrete bins to bucket continuous features.

Split Constraints

  • min_child_weight (float, default=0.1): Minimum sum of instance Hessian needed in a child node.
  • gamma (float, default=0.0): Minimum loss reduction required to make a split.
  • max_delta_step (float, default=0.0): Maximum delta step allowed for each tree's leaf output (useful for highly unbalanced classes).

System & Features

  • cat_features (list of int, default=None): List of feature indices to be treated as categorical features.
  • random_state (int, default=0): Seed for random number generators (subsampling, colsample).
  • verbose (bool, default=True): Controls C++ engine logging during training.

Environment Variables for Advanced Tuning

HRBoost exposes internal engine dynamics through system environment variables to avoid hyperparameter inflation:

  • COHESION_REG (float, default=0.3):
    • Controls the intensity of the Dynamic Cohesion Regularization during tree splitting.
    • A cohesion penalty factor is computed dynamically based on the difference in predicted leaf values between prospective children. If child leaf predictions diverge excessively, L2 regularization is dynamically increased.
    • Set export COHESION_REG=0.0 to disable this penalty. High-noise categorical settings benefit from higher values (e.g., 0.5 or 1.0).
  • MIN_CAT_COUNT (float, default=automatically scaled):
    • The minimum count required for a categorical bin to participate in BHC clustering. It helps filter out extremely rare categorical values.

Quick Start

1. Classification

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from hrboost import HRBoostClassifier

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

clf = HRBoostClassifier(n_estimators=100, learning_rate=0.1, max_depth=4)
clf.fit(X_train, y_train)

print(f"Test Accuracy: {clf.score(X_test, y_test):.4f}")

2. Regression

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from hrboost import HRBoostRegressor

diabetes = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(
    diabetes.data, diabetes.target, test_size=0.2, random_state=42
)

reg = HRBoostRegressor(n_estimators=150, learning_rate=0.08, max_depth=4)
reg.fit(X_train, y_train)

print(f"Test R2 Score: {reg.score(X_test, y_test):.4f}")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hrboost-0.1.1.tar.gz (49.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hrboost-0.1.1-py3-none-any.whl (45.7 kB view details)

Uploaded Python 3

File details

Details for the file hrboost-0.1.1.tar.gz.

File metadata

  • Download URL: hrboost-0.1.1.tar.gz
  • Upload date:
  • Size: 49.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for hrboost-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d84ef2d01871846ca936b262c9f9a69376ea0eb68eb7f93898d84df31564f6ff
MD5 3d9ffcd2a48ac49839ffce63acc18fc8
BLAKE2b-256 4345e1708162aac7e16324a625f7cf658ff597e080602ad8efc7c4fe9eb98aa6

See more details on using hashes here.

File details

Details for the file hrboost-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: hrboost-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 45.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for hrboost-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 01d03e51789feaf9a3f7f23c60e6f932c71fbaac8be6aef8f211d15af85dcefa
MD5 84d16fe185ed37ae5787cd8b83b38026
BLAKE2b-256 a1c14a96d72cf6ec306545fe11371aefe6a6578b258872a222b1e5048e0f5a8d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page