Skip to main content

HRBoost: Hierarchical Refined Boost - GBDT with Non-monotonic Bayesian Hierarchical Clustering

Project description

HRBoost (Hierarchical Refined Boost)

HRBoost is a fast, lightweight Gradient Boosting Decision Tree (GBDT) library built in C++ and Python. It introduces a Non-monotonic Bayesian Hierarchical Clustering (LNM-BHC, $k=3$) algorithm inside its core engine to find optimal splits for high-cardinality categorical variables with zero manual parameter tuning.

HRBoost is 100% compliant with the scikit-learn API, offering both HRBoostClassifier and HRBoostRegressor.


Installation

pip install hrboost

Hyperparameter Reference

HRBoostClassifier and HRBoostRegressor accept the following parameters in their constructors:

Core GBDT Parameters

  • n_estimators (int, default=200): The number of boosting rounds (trees to build).
  • learning_rate (float, default=0.1): Shrinkage rate applied to each tree's update to prevent overfitting.
  • max_depth (int, default=4): Maximum depth of each decision tree.
  • max_leaves (int, default=64): Maximum number of leaves allowed per tree.
  • reg_lambda (float, default=1.0): L2 regularization term on weights. It also scales the baseline regularization for Bayesian Hierarchical Clustering.
  • subsample (float, default=0.8): Fraction of training samples randomly chosen to train each tree.
  • colsample_bytree (float, default=1.0): Fraction of features randomly selected for building each tree.
  • n_bins (int, default=32): Maximum number of discrete bins to bucket continuous features.

Split Constraints

  • min_child_weight (float, default=0.1): Minimum sum of instance Hessian needed in a child node.
  • gamma (float, default=0.0): Minimum loss reduction required to make a split.
  • max_delta_step (float, default=0.0): Maximum delta step allowed for each tree's leaf output (useful for highly unbalanced classes).

System & Features

  • cat_features (list of int, default=None): List of feature indices to be treated as categorical features.
  • random_state (int, default=0): Seed for random number generators (subsampling, colsample).
  • verbose (bool, default=True): Controls C++ engine logging during training.

Environment Variables for Advanced Tuning

HRBoost exposes internal engine dynamics through system environment variables to avoid hyperparameter inflation:

  • COHESION_REG (float, default=0.3):
    • Controls the intensity of the Dynamic Cohesion Regularization during tree splitting.
    • A cohesion penalty factor is computed dynamically based on the difference in predicted leaf values between prospective children. If child leaf predictions diverge excessively, L2 regularization is dynamically increased.
    • Set export COHESION_REG=0.0 to disable this penalty. High-noise categorical settings benefit from higher values (e.g., 0.5 or 1.0).
  • MIN_CAT_COUNT (float, default=automatically scaled):
    • The minimum count required for a categorical bin to participate in BHC clustering. It helps filter out extremely rare categorical values.

Quick Start

1. Classification

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from hrboost import HRBoostClassifier

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

clf = HRBoostClassifier(n_estimators=100, learning_rate=0.1, max_depth=4)
clf.fit(X_train, y_train)

print(f"Test Accuracy: {clf.score(X_test, y_test):.4f}")

2. Regression

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from hrboost import HRBoostRegressor

diabetes = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(
    diabetes.data, diabetes.target, test_size=0.2, random_state=42
)

reg = HRBoostRegressor(n_estimators=150, learning_rate=0.08, max_depth=4)
reg.fit(X_train, y_train)

print(f"Test R2 Score: {reg.score(X_test, y_test):.4f}")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hrboost-0.1.2.tar.gz (50.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hrboost-0.1.2-py3-none-any.whl (45.7 kB view details)

Uploaded Python 3

File details

Details for the file hrboost-0.1.2.tar.gz.

File metadata

  • Download URL: hrboost-0.1.2.tar.gz
  • Upload date:
  • Size: 50.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for hrboost-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2a8eebeb19b90ce11a54d6b5c39457d3c5f4b4cb6e0209133b8511146323b469
MD5 6902d43172ae131b49efa22254f0686b
BLAKE2b-256 a1aaed78e817e0f3add5662b531b657be9d377ddfd89e39de3f6a3e415037d7d

See more details on using hashes here.

File details

Details for the file hrboost-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: hrboost-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 45.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for hrboost-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c7eeea2ed79f427e1a438533c3476824f01bbbee3fd9e9ad501c49d4c4ca540c
MD5 c279dfb7ed67d0034c12bfb4ade1ce97
BLAKE2b-256 a23c27a8af5e3d2dd387b6fddc8f217d6368423d26c132109eec2372442ede52

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page