Skip to main content

A robust feature selection library using ensemble mRMR and optional refinement.

Project description

pyrobustfs

A robust feature selection library for Python, leveraging ensemble Minimum Redundancy Maximum Relevance (mRMR) with an optional refinement step. Designed for seamless integration into scikit-learn pipelines.

Features

  • Ensemble mRMR: Improves robustness and stability of feature selection by running mRMR on bootstrapped/subsampled data and aggregating results.
  • Scikit-learn Compatibility: Implements BaseEstimator and TransformerMixin for easy integration into scikit-learn pipelines, GridSearchCV, and RandomizedSearchCV.
  • Flexible Refinement: Allows for an optional second stage of model-specific feature selection using any scikit-learn compatible estimator (e.g., RFE, SelectFromModel).
  • Classification and Regression Support: Handles both classification (using mutual information for classification) and regression (using mutual information for regression) tasks.

Installation

Currently, pyrobustfs is not yet available on PyPI. You can install it directly from the source code:

  1. Clone the repository:

    git clone https://github.com/yourusername/pyrobustfs.git
    cd pyrobustfs
    
  2. Install in editable mode (for development) or standard mode:

    # For development (changes to code are immediately reflected)
    pip install -e .
    
    # For standard installation
    # pip install .
    

Usage

Basic Feature Selection

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from pyrobustfs.selectors import RobustMRMRSelector

# Generate synthetic classification data
X, y = make_classification(n_samples=1000, n_features=50, n_informative=10, n_redundant=5, random_state=42)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and fit the selector
# Select 5 features using 10 ensemble runs for a classification task
selector = RobustMRMRSelector(n_features_to_select=5, n_ensembles=10, classification=True, random_state=42)
selector.fit(X_train, y_train)

# Get the names of the selected features
selected_features = selector.get_feature_names_out()
print(f"Selected features: {selected_features}")

# Transform the data to keep only the selected features
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

print(f"Original X_train shape: {X_train.shape}")
print(f"Transformed X_train shape: {X_train_selected.shape}")

Using with a Refiner Estimator

You can provide an optional refiner_estimator for a second stage of feature selection. This is useful for model-specific refinement.

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE # Recursive Feature Elimination
from pyrobustfs.selectors import RobustMRMRSelector

# Generate synthetic classification data
X, y = make_classification(n_samples=1000, n_features=50, n_informative=10, n_redundant=5, random_state=42)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define a refiner estimator (e.g., RFE with Logistic Regression)
# The refiner will operate on the features pre-selected by the ensemble mRMR.
refiner = RFE(estimator=LogisticRegression(solver='liblinear', random_state=42), n_features_to_select=3)

# Initialize RobustMRMRSelector with the refiner
selector_with_refiner = RobustMRMRSelector(
    n_features_to_select=5, # This is the target for ensemble mRMR, refiner might override
    n_ensembles=10,
    refiner_estimator=refiner,
    classification=True,
    random_state=42
)

selector_with_refiner.fit(X_train, y_train)
selected_features_refiner = selector_with_refiner.get_feature_names_out()
print(f"Selected features (with refiner): {selected_features_refiner}")
print(f"Number of features selected by refiner: {len(selected_features_refiner)}")

# Transform data
X_train_refined = selector_with_refiner.transform(X_train)
print(f"Transformed X_train shape (with refiner): {X_train_refined.shape}")

Integrating into a Scikit-learn Pipeline

RobustMRMRSelector can be seamlessly integrated into a scikit-learn Pipeline.

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from pyrobustfs.selectors import RobustMRMRSelector

# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=50, n_informative=10, n_redundant=5, random_state=42)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a pipeline with feature selection and a classifier
pipeline = Pipeline([
    ('feature_selection', RobustMRMRSelector(n_features_to_select=5, n_ensembles=10, classification=True, random_state=42)),
    ('classifier', LogisticRegression(solver='liblinear', random_state=42))
])

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Evaluate the pipeline
accuracy = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {accuracy:.4f}")

# Access selected features from the pipeline step
selected_features_pipeline = pipeline.named_steps['feature_selection'].get_feature_names_out()
print(f"Selected features from pipeline: {selected_features_pipeline}")

Development

To contribute or run tests, clone the repository and install in editable mode:

git clone https://github.com/yourusername/pyrobustfs.git
cd pyrobustfs
pip install -e .
pip install pytest
pytest

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyrobustfs-0.1.1.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyrobustfs-0.1.1-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file pyrobustfs-0.1.1.tar.gz.

File metadata

  • Download URL: pyrobustfs-0.1.1.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for pyrobustfs-0.1.1.tar.gz
Algorithm Hash digest
SHA256 29b3155dc45857e16ecb7575e39908c577eb783b2890d0831667a9b18b0b5994
MD5 71f69875adf51baa2cd477eba8c63906
BLAKE2b-256 255cac3f1a1a566c9dd24d79bfc90d66fb3c55f1876d5a54161d80bc0b358210

See more details on using hashes here.

File details

Details for the file pyrobustfs-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pyrobustfs-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for pyrobustfs-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cd5408944b547c92b9186b4397079e22366ddae82f95915f26844bb40bdd1f98
MD5 8e18c1ef84f0822d4639144648498e35
BLAKE2b-256 1a9f6618865593fe0ee77d17b05cf2a970cdfea5f51eeb92b494be476c6f437f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page