Skip to main content

Modern BorutaShap - Feature selection with SHAP values, NumPy 2.0+ compatible

Project description

borutashap-modern

PyPI version Python 3.12+ License: MIT

A modernized fork of BorutaShap that works with current versions of NumPy 2.0+, SciPy, and scikit-learn. This fork includes performance improvements and bug fixes for SHAP-based feature selection.

Installation

# Install from PyPI (recommended)
pip install borutashap-modern

# With LightGBM support (recommended for speed)
pip install borutashap-modern[lightgbm]

# With all optional dependencies
pip install borutashap-modern[all]

Key Improvements

Compatibility Fixes

  • NumPy 2.0+ support: Fixed deprecated np.NaN to np.nan
  • SciPy 1.11+ support: Updated binom_test to binomtest with backward compatibility
  • Python 3.12+ support: Requires Python 3.12 or higher

Bug Fixes

  • RandomForest + SHAP: Fixed 3D array handling and indexing issues
  • RandomForest + Gini: Fixed premature feature_importances_ check
  • Missing imports: Added required imports (inspect, defaultdict)

Performance Insights

Based on extensive benchmarking:

  • LightGBM: Best overall performer (0.6s avg SHAP time, F1=0.875)
  • XGBoost: Good balance (1.6s avg SHAP time, F1=0.868)
  • RandomForest: Best F1 on small datasets (F1=0.935 @ 1k samples)
  • GradientBoosting: Highest accuracy but slow (13s avg SHAP time)

Requirements

  • Python 3.12+
  • NumPy 2.0+
  • pandas 2.0+
  • scikit-learn 1.3+
  • SHAP 0.45+
  • LightGBM 4.0+ (optional, recommended)
  • XGBoost 2.0+ (optional)

Quick Start

from BorutaShap import BorutaShap
from lightgbm import LGBMClassifier
from sklearn.datasets import make_classification
import pandas as pd

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(20)])

# Initialize with LightGBM (recommended for speed)
model = LGBMClassifier(n_estimators=50, max_depth=5, verbose=-1)

# Run BorutaShap
fs = BorutaShap(
    model=model,
    importance_measure='shap',  # or 'gini' for tree-based models
    classification=True
)

fs.fit(X=X, y=y, n_trials=100, random_state=42)

# Get results
print(f"Accepted features: {fs.accepted}")
print(f"Rejected features: {fs.rejected}")
print(f"Tentative features: {fs.tentative}")

Performance Recommendations

Model Selection Guide

Use Case Recommended Model F1 Score SHAP Speed
Small data (<5k samples) RandomForest 0.935 0.15s
Medium data (5-50k) LightGBM 0.90 0.5-2s
Large data (>50k) LightGBM 0.89 2-5s
Best accuracy GradientBoosting 0.91 10-50s
Production/speed critical LightGBM 0.88 <2s

Dataset Size Impact

  • Samples: More samples → better F1 (all models improve 5-9%)
  • Features: More features → worse F1 (especially RandomForest: -15% from 10→200 features)
  • Sweet spot: 5-10k samples with ≤50 features

Feature Importance Methods

  • SHAP: More accurate but ~11x slower than Gini
  • Gini: Fast but only for tree-based models (not XGBoost)
  • Recommendation: Use SHAP for final models, Gini for exploration

Supported Models

Fully Supported:

  • LightGBM (fastest SHAP)
  • XGBoost (SHAP only)
  • RandomForest (both SHAP and Gini)
  • ExtraTrees (both SHAP and Gini)
  • GradientBoosting (both SHAP and Gini)

Not Supported:

  • BaggingClassifier (SHAP TreeExplainer incompatible)
  • SVM, Neural Networks (no tree structure)

Testing

# Run basic test
python examples/test_basic.py

# Run performance comparison
python examples/compare_models.py

# Test with your data
python examples/test_custom.py --data your_data.csv

Changes from Original

  1. Fixed NumPy 2.0 compatibility (src/BorutaShap.py:L384-394)
  2. Fixed SciPy binomial test import (src/BorutaShap.py:L8-13)
  3. Fixed RandomForest SHAP 3D array handling (src/BorutaShap.py:L250-260)
  4. Fixed RandomForest Gini importance check (src/BorutaShap.py:L150-155)
  5. Added Python 3.12 support (setup.py)
  6. Added comprehensive benchmarks (examples/benchmark.py)

Citation

If you use this fork, please cite both the original and this fork:

# Original BorutaShap
@software{boruta_shap,
  author = {Eoghan Keany},
  title = {BorutaShap: A wrapper feature selection method using Boruta and SHAP},
  url = {https://github.com/Ekeany/Boruta-Shap},
  year = {2020}
}

# This fork
@software{boruta_shap_modern,
  author = {BlackArbsCEO},
  title = {BorutaShap Modern Fork: Compatible with NumPy 2.0+},
  url = {https://github.com/BlackArbsCEO/Boruta-Shap},
  year = {2024}
}

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Run tests with Python 3.9+
  4. Submit a pull request

License

MIT License (same as original)

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

borutashap_modern-1.1.0.tar.gz (17.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

borutashap_modern-1.1.0-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file borutashap_modern-1.1.0.tar.gz.

File metadata

  • Download URL: borutashap_modern-1.1.0.tar.gz
  • Upload date:
  • Size: 17.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.23

File hashes

Hashes for borutashap_modern-1.1.0.tar.gz
Algorithm Hash digest
SHA256 1f8c752e8b331c789e32436efbc6ca6d73393f9b2fca0f9c114b0b712390e49c
MD5 85aa25793cfeb68896375c72098616b2
BLAKE2b-256 83a050ad5a33ad3653fd685b853d7aeac00e64d9ea3c57250260caf2ec99e5ee

See more details on using hashes here.

File details

Details for the file borutashap_modern-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for borutashap_modern-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6450b49e3614990ec8d05028d4198e449059eefe3cfaf0d804ef491491d09ab8
MD5 1b2f650d01b3c58fccdb38806ee53050
BLAKE2b-256 7f3f1ec77469e24c5cfbd8e209563fba96b36050148715f216442a72513fda31

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page