Skip to main content

Rust-first gradient boosting for regression, classification, and ranking with time-aware validation and Python bindings

Project description

AlloyGBM

AlloyGBM is a Rust-first gradient boosting library with Python bindings, supporting regression, binary and multi-class classification, and learning-to-rank. It is built for fast native execution, deterministic training, and time-aware tabular workflows.

AlloyGBM is strongest on panel and finance-style problems where leakage-aware validation and practical iteration speed matter. It also performs competitively on general tabular benchmarks and includes native artifact prediction, TreeSHAP explanations, and purged time-series split helpers.

When To Use AlloyGBM

AlloyGBM is a good fit when you want:

  • a native Rust-backed gradient boosting library with regression, classification, and ranking
  • deterministic CPU training and inference
  • sklearn-compatible estimators (GBMRegressor, GBMClassifier, GBMRanker)
  • time-aware validation helpers for forecasting or panel-style workflows
  • native prediction from serialized artifacts
  • TreeSHAP explanations and global feature importances
  • NaN/missing value support out of the box
  • model persistence via pickle, save/load, or artifact export

Installation

PyPI:

pip install alloygbm

From source:

python -m pip install --upgrade maturin
maturin develop --manifest-path bindings/python/Cargo.toml --release

AlloyGBM targets Python 3.11+ and uses a native Rust extension module.

Wheel targets for 0.7.0:

  • macOS arm64
  • Linux x86_64 (manylinux)
  • source distribution for other platforms

Quick Examples

Regression

from alloygbm import GBMRegressor, rmse

model = GBMRegressor(
    learning_rate=0.05,
    max_depth=6,
    n_estimators=1200,
    deterministic=True,
    seed=7,
)
model.fit(X_train, y_train, eval_set=(X_valid, y_valid))
print(rmse(y_test, model.predict(X_test)))

Binary Classification

from alloygbm import GBMClassifier, accuracy, log_loss

model = GBMClassifier(
    learning_rate=0.05,
    max_depth=6,
    n_estimators=500,
    deterministic=True,
    seed=7,
)
model.fit(X_train, y_train)

labels = model.predict(X_test)            # [0, 1, 1, 0, ...]
probas = model.predict_proba(X_test)      # [[P(0), P(1)], ...]

print("accuracy:", accuracy(y_test, labels))
print("log_loss:", log_loss(y_test, probas[:, 1]))

Learning-to-Rank

from alloygbm import GBMRanker, ndcg

model = GBMRanker(
    ranking_objective="rank:ndcg",
    learning_rate=0.05,
    max_depth=6,
    n_estimators=300,
    deterministic=True,
    seed=7,
)
model.fit(X_train, y_train, group=query_ids_train)

scores = model.predict(X_test)
print("NDCG@10:", ndcg(y_test, scores, group=query_ids_test, k=10))

MorphBoost (Adaptive Split Criterion)

MorphBoost is an opt-in training mode that blends the standard gradient gain with a normalized information-theoretic term. Across rounds, the blend ramps in via a tanh(iter/20) warmup, an EMA over per-class gradient statistics shapes split selection, and leaf magnitudes are scaled by a depth penalty and per-iteration shrinkage. See the MorphBoost paper for the formulation.

from alloygbm import GBMRegressor

# Constant LR (default) with morph adaptive split criterion
model = GBMRegressor(
    n_estimators=1200,
    max_depth=6,
    learning_rate=0.05,
    training_mode="morph",      # opt in
    morph_rate=0.1,             # per-round leaf shrinkage
    info_score_weight=0.3,      # blend weight for info-theoretic term
    depth_penalty_base=0.9,     # multiplier per depth level
    balance_penalty=True,       # penalize highly imbalanced splits
    seed=7,
)
model.fit(X_train, y_train)

# With warmup-cosine LR schedule (good fit for very-low-LR runs)
model = GBMRegressor(
    n_estimators=5000,
    learning_rate=0.01,
    training_mode="morph",
    lr_schedule="warmup_cosine",
    lr_warmup_frac=0.1,         # fraction of n_estimators spent in warmup
    seed=7,
)

training_mode="morph" works with GBMClassifier and GBMRanker too, with identical parameter semantics.

DRO Leaf Solver (Robust Scalar Leaves)

Set leaf_solver="dro" to use a fast Wasserstein-inspired robust Newton update for scalar leaves. The solver penalizes each candidate leaf by within-leaf gradient dispersion, reducing sensitivity to noisy or weak leaf signals while keeping prediction speed identical to standard constant leaves.

from alloygbm import GBMRegressor

model = GBMRegressor(
    n_estimators=600,
    max_depth=6,
    learning_rate=0.05,
    leaf_solver="dro",
    dro_radius=0.05,
    dro_metric="wasserstein",
    seed=7,
)
model.fit(X_train, y_train)

leaf_solver="dro" works with GBMRegressor, GBMClassifier, and GBMRanker, and composes with training_mode="morph". In v0.7.0 it requires leaf_model="constant"; piecewise-linear leaves still use the standard PL solver. dro_radius=0.0 preserves standard-leaf predictions while retaining DRO metadata in the artifact.

Factor-Neutral Boosting

Use neutralization="per_round_gradient" with fit(..., factor_exposures=F) to project each boosting round's pseudo-residuals away from user-supplied nuisance factors. This is useful when common factors explain high-variance signal that you do not want the model to spend tree capacity learning.

This is a training-time regularization tool. It does not guarantee prediction-time zero exposure unless predictions are neutralized against evaluation-time factors outside the model.

Constructor parameters:

GBMRegressor(
    neutralization="none",                 # "none" | "pre_target" | "per_round_gradient" | "split_penalty"
    factor_neutralization_lambda=1e-6,      # finite, >= 0 ridge added to F^T W F
    factor_penalty=0.0,                     # finite, >= 0; only active for neutralization="split_penalty"
)

factor_exposures is dense, row-major, finite, and shaped (n_rows, n_factors). It is fit data, not constructor state, so sklearn cloning remains clean and large matrices are not embedded in estimator params.

Mode semantics:

neutralization="none" preserves current behavior and ignores factor_exposures unless a non-None matrix is provided with an inactive mode, in which case Python raises a clear validation error to prevent silent user mistakes.

neutralization="pre_target" residualizes the regression target once before training:

y_perp = y - F (F^T W F + lambda I)^-1 F^T W y

This mode is supported for GBMRegressor only. It is rejected for classification and ranking because target residualization is not well-defined for class labels or ranking relevance. eval_set is also rejected for pre_target in this release because the public API does not yet accept validation-set factor exposures to residualize validation targets consistently.

neutralization="per_round_gradient" projects objective gradients before each boosting round:

g_perp = g - F (F^T W F + lambda I)^-1 F^T W g

Hessians are unchanged. This mode is supported for regression, binary classification, multiclass, and ranking. For multiclass, each class-gradient column is projected independently against the same factor projector.

neutralization="split_penalty" includes per-round gradient projection and subtracts a factor-load penalty from split gain:

penalty = factor_penalty * || F_L^T update_L + F_R^T update_R ||^2 / max(row_count, 1)
gain_final = gain_after_existing_modes - penalty

For scalar leaves, update_L and update_R are the candidate scalar leaf values before any final MorphBoost depth/iteration leaf scaling. For DRO leaves, the scalar values use the DRO effective gradients. For MorphBoost, the order is: project gradients, compute standard/DRO gradient gain, blend MorphBoost information score, subtract factor penalty, then apply MorphBoost leaf scaling when storing leaves. split_penalty performs additional factor-exposure work during split search and should be treated as the slowest neutralization mode until production-scale benchmarks justify stronger claims.

Compatibility:

Feature pre_target per_round_gradient split_penalty
GBMRegressor supported supported supported
GBMClassifier rejected supported supported
GBMRanker rejected supported supported
training_mode="morph" supported supported supported
leaf_solver="dro" supported supported supported
leaf_model="linear" supported supported rejected
warm start rejected in this release rejected in this release rejected in this release

Exposure matrices are not persisted in the estimator or artifact. For this release, neutralized warm-start and init_model continuation are rejected because artifacts do not yet persist neutralization metadata needed to prove that the previous model and current estimator have matching neutralization contracts.

Piecewise-Linear Leaves

Set leaf_model="linear" on any estimator to replace scalar leaves with small closed-form linear models (f_s(x) = b_s + Σ α_j x_j). Weights are solved via ridge regression α* = -(XᵀHX + λI)⁻¹ Xᵀg regularised by lambda_l2. This typically converges in fewer rounds on data with linear within-node residual structure (e.g. California Housing), at a 2–8× per-round training overhead.

from alloygbm import GBMRegressor

model = GBMRegressor(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    leaf_model="linear",
    lambda_l2=0.01,    # recommended >= 0.01 with linear leaves
    seed=7,
)
model.fit(X_train, y_train)

leaf_model="linear" works with GBMClassifier and GBMRanker, and composes with training_mode="morph". SHAP currently requires leaf_model="constant".

Time-Aware Validation

from alloygbm import GBMRegressor, purged_time_series_splits, rmse

splits = purged_time_series_splits(time_index, n_splits=5, purge_gap=1, embargo=0)

for train_idx, test_idx in splits:
    model = GBMRegressor(deterministic=True, seed=7)
    model.fit(
        [rows[i] for i in train_idx],
        [targets[i] for i in train_idx],
    )
    score = rmse(
        [targets[i] for i in test_idx],
        model.predict([rows[i] for i in test_idx]),
    )

For panel data, use purged_panel_splits(...).

Model Persistence

import pickle

# Pickle round-trip
with open("model.pkl", "wb") as f:
    pickle.dump(model, f)
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

# Native save/load
model.save_model("model.agbm")
loaded = GBMRegressor.load_model("model.agbm")

# Artifact export for deployment
artifact_bytes = model.artifact_bytes

Feature Summary

Estimators

  • GBMRegressor -- squared-error regression with dataset-aware training_policy
  • GBMClassifier -- binary classification with log-loss objective, predict_proba, sklearn ClassifierMixin
  • GBMRanker -- learning-to-rank with 5 objectives: rank:pairwise, rank:ndcg, rank:xendcg, queryrmse, yetirank
  • All estimators are sklearn-compatible (get_params, set_params, score, pipeline integration)

Training Features

  • NaN/missing value support with learned split direction
  • Sample weights via fit(..., sample_weight=...)
  • Monotone constraints via monotone_constraints
  • Feature importance weighting via feature_weights
  • Leaf-wise (best-first) tree growth via tree_growth="leaf"
  • Warm-starting / incremental training via warm_start=True
  • Up to 65,535 bins per feature (continuous_binning_max_bins)
  • Multiple categorical column support via categorical_feature_indices
  • Early stopping with best_iteration_, best_score_, evals_result_
  • Objective-aware training metric tracking (RMSE, log-loss, accuracy, NDCG)
  • Adaptive split criterion via training_mode="morph" (MorphBoost)
  • Per-iteration learning-rate schedules: lr_schedule="constant" (default) or "warmup_cosine"
  • DRO-style robust scalar leaves via leaf_solver="dro" (closed-form gradient-uncertainty penalty)
  • Piecewise-linear leaves via leaf_model="linear" (closed-form ridge solve, faster convergence on linear-trend data)

Inference and Explanations

  • Zero-copy numpy prediction from native artifacts
  • TreeSHAP explanations via shap_values(...) (polynomial-time, no feature limit)
  • Global feature importance via feature_importances(...)
  • Artifact-backed prediction via predict_from_artifact(...)

Validation Helpers

  • purged_time_series_splits(...) -- leakage-aware time-series cross-validation
  • purged_panel_splits(...) -- panel-data cross-validation

Metrics

  • Regression: rmse, mae, r2_score
  • Classification: accuracy, log_loss
  • Ranking: ndcg
  • Finance: pearson_correlation, rank_ic, hit_rate, icir

Benchmark Snapshot

The benchmark suite compares AlloyGBM against XGBoost, LightGBM, and CatBoost across regression, classification, and ranking tasks.

Regression:

  • AlloyGBM is strongest on panel_time_series
  • AlloyGBM is strong on dow_jones_financial
  • AlloyGBM is competitive on dense_numeric, trails on california_housing and bike_sharing

Classification:

  • AlloyGBM is competitive with established libraries on breast_cancer and synthetic_classification

Ranking:

  • AlloyGBM competes on synthetic_ranking using its native LambdaMART implementation

Benchmark tooling and methodology live in benchmarks/README.md.

Current Limitations

  • CPU-only runtime (GPU backend is architecturally planned but not implemented)
  • No interaction constraints
  • No dart/goss boosting modes
  • SHAP not yet supported with leaf_model="linear" (use "constant" for now)
  • leaf_solver="dro" is a robust scalar leaf update, not a full raw-distribution Wasserstein DRO guarantee

Documentation

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alloygbm-0.7.0.tar.gz (288.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

alloygbm-0.7.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ x86-64

alloygbm-0.7.0-cp311-abi3-macosx_11_0_arm64.whl (941.9 kB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

File details

Details for the file alloygbm-0.7.0.tar.gz.

File metadata

  • Download URL: alloygbm-0.7.0.tar.gz
  • Upload date:
  • Size: 288.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for alloygbm-0.7.0.tar.gz
Algorithm Hash digest
SHA256 329ec1e369f8ec0bfb41824b19463b307299d5b6f9062a888ef73fe8fc2f6415
MD5 b7aa21f84837bb88bd01470d5e5ea8bb
BLAKE2b-256 a804d1789966bf787d59b0bf809f29e64a7be575e64c2eefdd19dcf6f6619e58

See more details on using hashes here.

Provenance

The following attestation bundles were made for alloygbm-0.7.0.tar.gz:

Publisher: publish.yml on LGA-Personal/AlloyGBM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file alloygbm-0.7.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for alloygbm-0.7.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e4b3fe255e29aae45a0efaa683690df74e8ccc5c564b3726b8b97bb3ac13140b
MD5 dc5b6038e2b93a8039857d2ca9b5047d
BLAKE2b-256 296cd9b036dbbea7683083e92828691780663eb749c2d7c15a5b5c701c4e4d5d

See more details on using hashes here.

Provenance

The following attestation bundles were made for alloygbm-0.7.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on LGA-Personal/AlloyGBM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file alloygbm-0.7.0-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for alloygbm-0.7.0-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 81ca52be4405e0a5c7b1cbb41f72a599838d80f5308e0a8bf23735962efc1057
MD5 d486a74b94e41b88f714ea9d7e2d8078
BLAKE2b-256 e4b614ecf0c8cd90c6466e2b35ba08bced241088ed2c90b3703fe65d7edbf14c

See more details on using hashes here.

Provenance

The following attestation bundles were made for alloygbm-0.7.0-cp311-abi3-macosx_11_0_arm64.whl:

Publisher: publish.yml on LGA-Personal/AlloyGBM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page