Rust-first gradient boosting for regression, classification, and ranking with time-aware validation and Python bindings
Project description
AlloyGBM
AlloyGBM is a Rust-first gradient boosting library with Python bindings, supporting regression, binary and multi-class classification, and learning-to-rank. It is built for fast native execution, deterministic training, and time-aware tabular workflows.
AlloyGBM is strongest on panel and finance-style problems where leakage-aware validation and practical iteration speed matter. It also performs competitively on general tabular benchmarks and includes native artifact prediction, TreeSHAP explanations, and purged time-series split helpers.
When To Use AlloyGBM
AlloyGBM is a good fit when you want:
- a native Rust-backed gradient boosting library with regression, classification, and ranking
- deterministic CPU training and inference
- sklearn-compatible estimators (
GBMRegressor,GBMClassifier,GBMRanker) - time-aware validation helpers for forecasting or panel-style workflows
- native prediction from serialized artifacts
- TreeSHAP explanations and global feature importances
- NaN/missing value support out of the box
- model persistence via pickle, save/load, or artifact export
Installation
PyPI:
pip install alloygbm
From source:
python -m pip install --upgrade maturin
maturin develop --manifest-path bindings/python/Cargo.toml --release
AlloyGBM targets Python 3.11+ and uses a native Rust extension module.
Wheel targets for 0.7.0:
- macOS
arm64 - Linux
x86_64(manylinux) - source distribution for other platforms
Quick Examples
Regression
from alloygbm import GBMRegressor, rmse
model = GBMRegressor(
learning_rate=0.05,
max_depth=6,
n_estimators=1200,
deterministic=True,
seed=7,
)
model.fit(X_train, y_train, eval_set=(X_valid, y_valid))
print(rmse(y_test, model.predict(X_test)))
Binary Classification
from alloygbm import GBMClassifier, accuracy, log_loss
model = GBMClassifier(
learning_rate=0.05,
max_depth=6,
n_estimators=500,
deterministic=True,
seed=7,
)
model.fit(X_train, y_train)
labels = model.predict(X_test) # [0, 1, 1, 0, ...]
probas = model.predict_proba(X_test) # [[P(0), P(1)], ...]
print("accuracy:", accuracy(y_test, labels))
print("log_loss:", log_loss(y_test, probas[:, 1]))
Learning-to-Rank
from alloygbm import GBMRanker, ndcg
model = GBMRanker(
ranking_objective="rank:ndcg",
learning_rate=0.05,
max_depth=6,
n_estimators=300,
deterministic=True,
seed=7,
)
model.fit(X_train, y_train, group=query_ids_train)
scores = model.predict(X_test)
print("NDCG@10:", ndcg(y_test, scores, group=query_ids_test, k=10))
MorphBoost (Adaptive Split Criterion)
MorphBoost is an opt-in training mode that blends the standard gradient gain
with a normalized information-theoretic term. Across rounds, the blend ramps
in via a tanh(iter/20) warmup, an EMA over per-class gradient statistics
shapes split selection, and leaf magnitudes are scaled by a depth penalty
and per-iteration shrinkage. See the
MorphBoost paper for the formulation.
from alloygbm import GBMRegressor
# Constant LR (default) with morph adaptive split criterion
model = GBMRegressor(
n_estimators=1200,
max_depth=6,
learning_rate=0.05,
training_mode="morph", # opt in
morph_rate=0.1, # per-round leaf shrinkage
info_score_weight=0.3, # blend weight for info-theoretic term
depth_penalty_base=0.9, # multiplier per depth level
balance_penalty=True, # penalize highly imbalanced splits
seed=7,
)
model.fit(X_train, y_train)
# With warmup-cosine LR schedule (good fit for very-low-LR runs)
model = GBMRegressor(
n_estimators=5000,
learning_rate=0.01,
training_mode="morph",
lr_schedule="warmup_cosine",
lr_warmup_frac=0.1, # fraction of n_estimators spent in warmup
seed=7,
)
training_mode="morph" works with GBMClassifier and GBMRanker too, with
identical parameter semantics.
DRO Leaf Solver (Robust Scalar Leaves)
Set leaf_solver="dro" to use a fast Wasserstein-inspired robust Newton update
for scalar leaves. The solver penalizes each candidate leaf by within-leaf
gradient dispersion, reducing sensitivity to noisy or weak leaf signals while
keeping prediction speed identical to standard constant leaves.
from alloygbm import GBMRegressor
model = GBMRegressor(
n_estimators=600,
max_depth=6,
learning_rate=0.05,
leaf_solver="dro",
dro_radius=0.05,
dro_metric="wasserstein",
seed=7,
)
model.fit(X_train, y_train)
leaf_solver="dro" works with GBMRegressor, GBMClassifier, and
GBMRanker, and composes with training_mode="morph". In v0.7.0 it requires
leaf_model="constant"; piecewise-linear leaves still use the standard PL
solver. dro_radius=0.0 preserves standard-leaf predictions while retaining
DRO metadata in the artifact.
Factor-Neutral Boosting
Use neutralization="per_round_gradient" with fit(..., factor_exposures=F) to project each boosting round's pseudo-residuals away from user-supplied nuisance factors. This is useful when common factors explain high-variance signal that you do not want the model to spend tree capacity learning.
This is a training-time regularization tool. It does not guarantee prediction-time zero exposure unless predictions are neutralized against evaluation-time factors outside the model.
Constructor parameters:
GBMRegressor(
neutralization="none", # "none" | "pre_target" | "per_round_gradient" | "split_penalty"
factor_neutralization_lambda=1e-6, # finite, >= 0 ridge added to F^T W F
factor_penalty=0.0, # finite, >= 0; only active for neutralization="split_penalty"
)
factor_exposures is dense, row-major, finite, and shaped
(n_rows, n_factors). It is fit data, not constructor state, so sklearn
cloning remains clean and large matrices are not embedded in estimator params.
Mode semantics:
neutralization="none" preserves current behavior and ignores
factor_exposures unless a non-None matrix is provided with an inactive mode,
in which case Python raises a clear validation error to prevent silent user
mistakes.
neutralization="pre_target" residualizes the regression target once before
training:
y_perp = y - F (F^T W F + lambda I)^-1 F^T W y
This mode is supported for GBMRegressor only. It is rejected for
classification and ranking because target residualization is not well-defined
for class labels or ranking relevance. eval_set is also rejected for
pre_target in this release because the public API does not yet accept
validation-set factor exposures to residualize validation targets consistently.
neutralization="per_round_gradient" projects objective gradients before each
boosting round:
g_perp = g - F (F^T W F + lambda I)^-1 F^T W g
Hessians are unchanged. This mode is supported for regression, binary classification, multiclass, and ranking. For multiclass, each class-gradient column is projected independently against the same factor projector.
neutralization="split_penalty" includes per-round gradient projection and
subtracts a factor-load penalty from split gain:
penalty = factor_penalty * || F_L^T update_L + F_R^T update_R ||^2 / max(row_count, 1)
gain_final = gain_after_existing_modes - penalty
For scalar leaves, update_L and update_R are the candidate scalar leaf
values before any final MorphBoost depth/iteration leaf scaling. For DRO
leaves, the scalar values use the DRO effective gradients. For MorphBoost, the
order is: project gradients, compute standard/DRO gradient gain, blend
MorphBoost information score, subtract factor penalty, then apply MorphBoost
leaf scaling when storing leaves. split_penalty performs additional
factor-exposure work during split search and should be treated as the slowest
neutralization mode until production-scale benchmarks justify stronger claims.
Compatibility:
| Feature | pre_target | per_round_gradient | split_penalty |
|---|---|---|---|
GBMRegressor |
supported | supported | supported |
GBMClassifier |
rejected | supported | supported |
GBMRanker |
rejected | supported | supported |
training_mode="morph" |
supported | supported | supported |
leaf_solver="dro" |
supported | supported | supported |
leaf_model="linear" |
supported | supported | rejected |
| warm start | rejected in this release | rejected in this release | rejected in this release |
Exposure matrices are not persisted in the estimator or artifact. For
this release, neutralized warm-start and init_model continuation are rejected
because artifacts do not yet persist neutralization metadata needed to prove
that the previous model and current estimator have matching neutralization
contracts.
Piecewise-Linear Leaves
Set leaf_model="linear" on any estimator to replace scalar leaves with small
closed-form linear models (f_s(x) = b_s + Σ α_j x_j). Weights are solved via
ridge regression α* = -(XᵀHX + λI)⁻¹ Xᵀg regularised by lambda_l2. This
typically converges in fewer rounds on data with linear within-node residual
structure (e.g. California Housing), at a 2–8× per-round training overhead.
from alloygbm import GBMRegressor
model = GBMRegressor(
n_estimators=300,
max_depth=6,
learning_rate=0.05,
leaf_model="linear",
lambda_l2=0.01, # recommended >= 0.01 with linear leaves
seed=7,
)
model.fit(X_train, y_train)
leaf_model="linear" works with GBMClassifier and GBMRanker, and composes
with training_mode="morph". SHAP currently requires leaf_model="constant".
Time-Aware Validation
from alloygbm import GBMRegressor, purged_time_series_splits, rmse
splits = purged_time_series_splits(time_index, n_splits=5, purge_gap=1, embargo=0)
for train_idx, test_idx in splits:
model = GBMRegressor(deterministic=True, seed=7)
model.fit(
[rows[i] for i in train_idx],
[targets[i] for i in train_idx],
)
score = rmse(
[targets[i] for i in test_idx],
model.predict([rows[i] for i in test_idx]),
)
For panel data, use purged_panel_splits(...).
Model Persistence
import pickle
# Pickle round-trip
with open("model.pkl", "wb") as f:
pickle.dump(model, f)
with open("model.pkl", "rb") as f:
model = pickle.load(f)
# Native save/load
model.save_model("model.agbm")
loaded = GBMRegressor.load_model("model.agbm")
# Artifact export for deployment
artifact_bytes = model.artifact_bytes
Feature Summary
Estimators
GBMRegressor-- squared-error regression with dataset-awaretraining_policyGBMClassifier-- binary classification with log-loss objective,predict_proba, sklearnClassifierMixinGBMRanker-- learning-to-rank with 5 objectives:rank:pairwise,rank:ndcg,rank:xendcg,queryrmse,yetirank- All estimators are sklearn-compatible (
get_params,set_params,score, pipeline integration)
Training Features
- NaN/missing value support with learned split direction
- Sample weights via
fit(..., sample_weight=...) - Monotone constraints via
monotone_constraints - Feature importance weighting via
feature_weights - Leaf-wise (best-first) tree growth via
tree_growth="leaf" - Warm-starting / incremental training via
warm_start=True - Up to 65,535 bins per feature (
continuous_binning_max_bins) - Multiple categorical column support via
categorical_feature_indices - Early stopping with
best_iteration_,best_score_,evals_result_ - Objective-aware training metric tracking (RMSE, log-loss, accuracy, NDCG)
- Adaptive split criterion via
training_mode="morph"(MorphBoost) - Per-iteration learning-rate schedules:
lr_schedule="constant"(default) or"warmup_cosine" - DRO-style robust scalar leaves via
leaf_solver="dro"(closed-form gradient-uncertainty penalty) - Piecewise-linear leaves via
leaf_model="linear"(closed-form ridge solve, faster convergence on linear-trend data)
Inference and Explanations
- Zero-copy numpy prediction from native artifacts
- TreeSHAP explanations via
shap_values(...)(polynomial-time, no feature limit) - Global feature importance via
feature_importances(...) - Artifact-backed prediction via
predict_from_artifact(...)
Validation Helpers
purged_time_series_splits(...)-- leakage-aware time-series cross-validationpurged_panel_splits(...)-- panel-data cross-validation
Metrics
- Regression:
rmse,mae,r2_score - Classification:
accuracy,log_loss - Ranking:
ndcg - Finance:
pearson_correlation,rank_ic,hit_rate,icir
Benchmark Snapshot
The benchmark suite compares AlloyGBM against XGBoost, LightGBM, and CatBoost across regression, classification, and ranking tasks.
Regression:
- AlloyGBM is strongest on
panel_time_series - AlloyGBM is strong on
dow_jones_financial - AlloyGBM is competitive on
dense_numeric, trails oncalifornia_housingandbike_sharing
Classification:
- AlloyGBM is competitive with established libraries on
breast_cancerandsynthetic_classification
Ranking:
- AlloyGBM competes on
synthetic_rankingusing its native LambdaMART implementation
Benchmark tooling and methodology live in benchmarks/README.md.
Current Limitations
- CPU-only runtime (GPU backend is architecturally planned but not implemented)
- No interaction constraints
- No dart/goss boosting modes
- SHAP not yet supported with
leaf_model="linear"(use"constant"for now) leaf_solver="dro"is a robust scalar leaf update, not a full raw-distribution Wasserstein DRO guarantee
Documentation
- Docs index: docs/README.md
- Benchmark guide: benchmarks/README.md
- Current roadmap: docs/roadmap/current.md
- Archive: docs/archive/README.md
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file alloygbm-0.7.0.tar.gz.
File metadata
- Download URL: alloygbm-0.7.0.tar.gz
- Upload date:
- Size: 288.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
329ec1e369f8ec0bfb41824b19463b307299d5b6f9062a888ef73fe8fc2f6415
|
|
| MD5 |
b7aa21f84837bb88bd01470d5e5ea8bb
|
|
| BLAKE2b-256 |
a804d1789966bf787d59b0bf809f29e64a7be575e64c2eefdd19dcf6f6619e58
|
Provenance
The following attestation bundles were made for alloygbm-0.7.0.tar.gz:
Publisher:
publish.yml on LGA-Personal/AlloyGBM
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
alloygbm-0.7.0.tar.gz -
Subject digest:
329ec1e369f8ec0bfb41824b19463b307299d5b6f9062a888ef73fe8fc2f6415 - Sigstore transparency entry: 1525464800
- Sigstore integration time:
-
Permalink:
LGA-Personal/AlloyGBM@954ebec8398fcc3abaaec61ef5fb714590b8e1cd -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/LGA-Personal
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@954ebec8398fcc3abaaec61ef5fb714590b8e1cd -
Trigger Event:
release
-
Statement type:
File details
Details for the file alloygbm-0.7.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: alloygbm-0.7.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.0 MB
- Tags: CPython 3.11+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4b3fe255e29aae45a0efaa683690df74e8ccc5c564b3726b8b97bb3ac13140b
|
|
| MD5 |
dc5b6038e2b93a8039857d2ca9b5047d
|
|
| BLAKE2b-256 |
296cd9b036dbbea7683083e92828691780663eb749c2d7c15a5b5c701c4e4d5d
|
Provenance
The following attestation bundles were made for alloygbm-0.7.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on LGA-Personal/AlloyGBM
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
alloygbm-0.7.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
e4b3fe255e29aae45a0efaa683690df74e8ccc5c564b3726b8b97bb3ac13140b - Sigstore transparency entry: 1525464846
- Sigstore integration time:
-
Permalink:
LGA-Personal/AlloyGBM@954ebec8398fcc3abaaec61ef5fb714590b8e1cd -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/LGA-Personal
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@954ebec8398fcc3abaaec61ef5fb714590b8e1cd -
Trigger Event:
release
-
Statement type:
File details
Details for the file alloygbm-0.7.0-cp311-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: alloygbm-0.7.0-cp311-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 941.9 kB
- Tags: CPython 3.11+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81ca52be4405e0a5c7b1cbb41f72a599838d80f5308e0a8bf23735962efc1057
|
|
| MD5 |
d486a74b94e41b88f714ea9d7e2d8078
|
|
| BLAKE2b-256 |
e4b614ecf0c8cd90c6466e2b35ba08bced241088ed2c90b3703fe65d7edbf14c
|
Provenance
The following attestation bundles were made for alloygbm-0.7.0-cp311-abi3-macosx_11_0_arm64.whl:
Publisher:
publish.yml on LGA-Personal/AlloyGBM
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
alloygbm-0.7.0-cp311-abi3-macosx_11_0_arm64.whl -
Subject digest:
81ca52be4405e0a5c7b1cbb41f72a599838d80f5308e0a8bf23735962efc1057 - Sigstore transparency entry: 1525464888
- Sigstore integration time:
-
Permalink:
LGA-Personal/AlloyGBM@954ebec8398fcc3abaaec61ef5fb714590b8e1cd -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/LGA-Personal
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@954ebec8398fcc3abaaec61ef5fb714590b8e1cd -
Trigger Event:
release
-
Statement type: