Skip to main content

A Fast XGBoost Feature Selection Algorithm

Project description

BoostARoota

A fast, practical feature selection algorithm built on XGBoost — with support for other scikit-learn tree-based models too.

Boruta was a great step forward for automated feature selection with Random Forests, but it can be slow on high-dimensional data and doesn't always transfer well to boosting models or other modern algorithms. Regularized linear methods like LASSO, Ridge, and Elastic Net have the opposite problem: they work well for linear models but not so much for trees and ensembles.

BoostARoota takes the core idea from Boruta — compare real features against randomized "shadow" features — and adapts it for XGBoost. In practice this means much faster runtimes and better feature sets for gradient boosting, while keeping the API familiar if you've used scikit-learn before.

Installation

pip install boostaroota

Requires Python 3.9+, pandas, numpy, scikit-learn, and xgboost. See requirements.txt for tested version ranges.

Quick start

BoostARoota expects a pandas DataFrame with numeric columns. If you have categoricals, one-hot encode them first (e.g. with pd.get_dummies). This is important — the shadow feature logic assumes numeric input, and string columns that get expanded can blow up your feature space.

from boostaroota import BoostARoota
import pandas as pd

# One-hot encode categoricals
X = pd.get_dummies(X)

# Pick an XGBoost metric you like. For multiclass, use "mlogloss".
br = BoostARoota(metric="logloss")

br.fit(X, y)

# Selected features
br.keep_vars_

# Filter down to just the useful columns
X_selected = br.transform(X)

That's the basic flow: fit, inspect keep_vars_, then transform.

A couple of gotchas I've run into:

  • If a numeric column is read in as object/string, get_dummies will explode it into lots of dummy columns. Cast to numeric first if that's not what you want.
  • For multiclass problems, BoostARoota currently only supports mlogloss as the eval metric.

You can see a more complete walkthrough in odsc_west/demo.py.

Using other tree models

You aren't limited to XGBoost. Any scikit-learn tree-based estimator with feature_importances_ will work, though you may need to tune cutoff, iters, etc. a bit since the defaults were chosen with XGBoost in mind.

from sklearn.ensemble import ExtraTreesClassifier
from boostaroota import BoostARoota

clf = ExtraTreesClassifier(n_estimators=100, random_state=0)
br = BoostARoota(clf=clf)

X_new = br.fit_transform(X, y)

If you pass both metric and clf, the classifier takes precedence and the metric is ignored (you'll get a warning).

Parameters

Defaults work well for most tabular datasets, but here's what you can tweak:

  • metric (str, default=None) – XGBoost eval metric like "logloss", "auc", "rmse", "mlogloss", etc. Required if you aren't passing your own clf. For multiclass, use "mlogloss".
  • clf (estimator, default=None) – A scikit-learn tree model. Leave as None to use XGBoost internally.
  • cutoff (float > 0, default=4) – Shadow importance is averaged and divided by this value to set the removal threshold. Higher = more conservative (fewer features removed). Lower = more aggressive.
  • iters (int > 0, default=10) – How many times to retrain per round to smooth out importance estimates. Don't use 1 — there's too much variance. Runtime scales linearly with this.
  • max_rounds (int > 0, default=100) – Hard cap on elimination rounds. The default is intentionally high; you'll rarely hit it unless the data is pathological or delta is very small.
  • delta (float, 0 < delta <= 1, default=0.1) – Minimum fraction of features that must be removed to continue to the next round. 0.1 means at least 10% need to go. Set to 1.0 to force a single round. Very small values can over-prune.
  • silent (bool, default=False) – Suppress per-iteration progress output. Warnings and errors still show.
  • task ({"auto", "classification", "regression"}, default="auto") – How to configure XGBoost. Auto-detects based on y, but you can override.

How it works

The intuition is straightforward:

  1. Start with a one-hot encoded feature matrix.
  2. Make a copy of every column and randomly shuffle each copy. These are the "shadow" features — they have the same distribution as the real ones but no relationship to the target.
  3. Train XGBoost (or your chosen tree model) on the combined real + shadow matrix. Repeat iters times with different shuffles to get stable importance estimates.
  4. For each feature, average its importance across iterations. Do the same for shadows.
  5. Compute a cutoff: mean shadow importance divided by the cutoff parameter (default 4). This makes the bar higher than just beating random noise.
  6. Drop any real feature whose mean importance is below that cutoff.
  7. Repeat from step 2 with the reduced feature set until fewer than delta fraction of features are removed in a round, or max_rounds is hit.

What you get back is the set of features that consistently beat the shuffled versions — a simple but effective signal that they're actually useful to the model.

Testing and examples

Install deps and run the suite:

pip install -r requirements.txt
pytest tests/test_boostaroota.py -q
# or
make test

For a quick end-to-end check across classification, regression, and sklearn backends:

make example
# or
python examples/run_example.py

See TESTING.md for full details on what's covered.

Notes

  • Input must be a pandas DataFrame. Numpy arrays will need to be wrapped first.
  • One-hot encoding is on you — BoostARoota doesn't do it automatically so you stay in control of how categoricals are handled.
  • If you hit weird results, double-check dtypes after get_dummies and make sure the target y is in the expected format for your chosen metric.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boostaroota-2.0.0.tar.gz (12.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

boostaroota-2.0.0-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file boostaroota-2.0.0.tar.gz.

File metadata

  • Download URL: boostaroota-2.0.0.tar.gz
  • Upload date:
  • Size: 12.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for boostaroota-2.0.0.tar.gz
Algorithm Hash digest
SHA256 797a8109f5491d235a6bf33525b92320d95666e709a30a8b233c0fc80178485d
MD5 74e71622b1b49d035677f86193538c40
BLAKE2b-256 a9dc149144e6b0b897773fea863022858f860a9f9428706358911ba8a81b471a

See more details on using hashes here.

File details

Details for the file boostaroota-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: boostaroota-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for boostaroota-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8a93c3fe0cfd2c5633989cde32b86027e7c67cc3f7303b6fbb6beade16e44bc1
MD5 3e496f52c4f60a5ddd9b15929e2442ca
BLAKE2b-256 7294717fa304f29dc676a88dd0c5a37dff8e8d5510bbe740d9652a66a100934b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page