Skip to main content

A Sci-Kit Learn compatible Numba and CUDA-accelerated implementation of various feature selection algorithms.

Project description

Fast-Select: Accelerated Feature Selection for Modern Datasets

PyPI version Build Status Python Versions License Code style: black Code style: ruff DOI

A high-performance Python library powered by Numba and CUDA, offering accelerated algorithms for feature selection. Initially built to optimize the complete Relief family of algorithms, fast-select aims to expand and accelerate a wide range of feature selection methods to empower machine learning on large-scale datasets.

Key Features

  • Blazing Fast Performance: Leverages Numba for JIT compilation, Joblib for multi-core parallelism, and Numba CUDA for GPU acceleration, providing unmatched performance while scaling with modern hardware.

  • ML Pipeline Integration: Fully compatible with Scikit-Learn, making it easy to fit into any machine learning pipeline with a familiar .fit(), .transform(), .fit_transform() interface.

  • Flexible Backends: Offers dual execution modes for both CPU (Joblib) and GPU (CUDA). Automatically detects hardware with an easy-to-use backend parameter.

  • Feature-Rich Implementation: Provides lightning-fast implementations of ReliefF, SURF, SURF*, MultiSURF, MultiSURF*, and TuRF—with plans to support additional feature selection algorithms in future releases.

  • Lightweight & Simple: Avoids heavy dependencies like TensorFlow or PyTorch while delivering state-of-the-art acceleration for feature selection workflows.

Table of Contents

  1. Installation
  2. Quickstart
  3. Backend Selection
  4. Benchmarking Highlights
  5. Algorithm Implementations
  6. Future Directions
  7. Contributing
  8. License
  9. How to Cite
  10. Acknowledgments

Installation

Install fast-select directly from PyPI:

pip install fast-select

For development versions (with testing and documentation dependencies):

git clone https://github.com/GavinLynch04/FastSelect.git
cd fast-select
pip install -e .[dev]

Quickstart

Using fast-select is simple and seamless for anyone familiar with Scikit-Learn.

from fast_select.estimators import MultiSURF
from sklearn.datasets import make_classification
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression  # Example classifier

# 1. Generate a synthetic dataset
X, y = make_classification(
    n_samples=500, 
    n_features=1000, 
    n_informative=20, 
    n_redundant=100, 
    random_state=42
)

# 2. Use the MultiSURF estimator to select the top 15 features
selector = MultiSURF(n_features_to_select=15)
X_selected = selector.fit_transform(X, y)
print(f"Original feature count: {X.shape[1]}")
print(f"Selected feature count: {X_selected.shape[1]}")
print(f"Top 15 feature indices: {selector.top_features_}")

# 3. Integrate into a Scikit-Learn Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('feature_selector', MultiSURF(n_features_to_select=10, backend='cpu')),
    ('classifier', LogisticRegression())
])

# Fit the pipeline (now featuring fast feature selection!)
# pipeline.fit(X, y)

Backend Selection (CPU vs. GPU)

You can control the computational backend with the backend parameter during initialization:

  • backend='auto': Automatically detects if an NVIDIA GPU is available. Falls back to CPU if a GPU is not available.

  • backend='gpu': Explicitly runs on GPU. Will raise a RuntimeError if no compatible GPU is found.

  • backend='cpu': Forces CPU computations, even if a GPU is available.

Example usage:

# Force CPU usage
cpu_selector = MultiSURF(n_features_to_select=10, backend='cpu')

# Force GPU usage
gpu_selector = MultiSURF(n_features_to_select=10, backend='gpu')

Benchmarking Highlights

Fast-Select delivers groundbreaking improvements in runtime and memory efficiency. Benchmarks show 50-100x speed-ups compared to scikit-rebate and R's CORElearn library, particularly on large datasets exceeding 10,000 samples and features. Benchmarking scripts are available in the repository for further testing.

Runtime vs. Number of Samples (n >> p)

Runtime Benchmark N-Dominant

Runtime vs. Number of Features (p >> n)

Memory Benchmark P-Dominant


Algorithm Implementations

Currently supported:

  • Relief-Family Algorithms:
    • ReliefF
    • SURF
    • SURF*
    • MultiSURF
    • MultiSURF*
    • TuRF

Future plans include additional feature selection algorithms, such as wrappers, embedded methods, and more filter-based approaches.

Contributing

Contributions are highly encouraged. Whether you're fixing bugs, improving performance, or proposing new algorithms, your work is invaluable. Please ensure your submissions include relevant test cases and documentation updates.

License

This project is licensed under the MIT License. See the LICENSE file for full details.

How to Cite

If you use fast-select in your research, please cite the software:

** Citing the Software (Specific Version):**

Use the version-specific DOI provided via Zenodo on the GitHub Release Page.

Acknowledgments

This library builds on the exceptional work of the following:

  • The Numba team for enabling JIT compilation and GPU acceleration.
  • The scikit-rebate authors for their inspiring Relief-based library.
  • The original researchers behind the Relief algorithms for their foundational contributions to feature selection.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_select-0.1.3.tar.gz (22.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fast_select-0.1.3-py3-none-any.whl (20.5 kB view details)

Uploaded Python 3

File details

Details for the file fast_select-0.1.3.tar.gz.

File metadata

  • Download URL: fast_select-0.1.3.tar.gz
  • Upload date:
  • Size: 22.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for fast_select-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d5851209a76fd7336b3fa54661b8083c655e5a0b3b7f2fa563afdd4f3616f132
MD5 fed91b51c460a499b509f2d3d26820d7
BLAKE2b-256 b0b619a596a0df203a8e6fa8295c0c8b747662370b88d51eb49535d3e946b7fa

See more details on using hashes here.

File details

Details for the file fast_select-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: fast_select-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for fast_select-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 40166475005dd1d37fe983a8c250aa5f13949c9e318b407a0fea80bb6ff95022
MD5 c5d4540a292f1419b286a4438b4e55ee
BLAKE2b-256 b9c30c7c8a47e9851e1bde4d38224086e38fe24cec5452355b2331b4cf16fd6d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page