Skip to main content

A robust feature selection library using ensemble mRMR and optional refinement.

Project description

pyrobustfs

A robust feature selection library for Python, leveraging ensemble Minimum Redundancy Maximum Relevance (mRMR) with an optional refinement step. Designed for seamless integration into scikit-learn pipelines.

Features

  • Ensemble mRMR: Improves robustness and stability of feature selection by running mRMR on bootstrapped/subsampled data and aggregating results.
  • Scikit-learn Compatibility: Implements BaseEstimator and TransformerMixin for easy integration into scikit-learn pipelines, GridSearchCV, and RandomizedSearchCV.
  • Flexible Refinement: Allows for an optional second stage of model-specific feature selection using any scikit-learn compatible estimator (e.g., RFE, SelectFromModel).
  • Classification and Regression Support: Handles both classification (using mutual information for classification) and regression (using mutual information for regression) tasks.

Installation

Currently, pyrobustfs is not yet available on PyPI. You can install it directly from the source code:

  1. Clone the repository:

    git clone https://github.com/yourusername/pyrobustfs.git
    cd pyrobustfs
    
  2. Install in editable mode (for development) or standard mode:

    # For development (changes to code are immediately reflected)
    pip install -e .
    
    # For standard installation
    # pip install .
    

Usage

Basic Feature Selection

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from pyrobustfs.selectors import RobustMRMRSelector

# Generate synthetic classification data
X, y = make_classification(n_samples=1000, n_features=50, n_informative=10, n_redundant=5, random_state=42)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and fit the selector
# Select 5 features using 10 ensemble runs for a classification task
selector = RobustMRMRSelector(n_features_to_select=5, n_ensembles=10, classification=True, random_state=42)
selector.fit(X_train, y_train)

# Get the names of the selected features
selected_features = selector.get_feature_names_out()
print(f"Selected features: {selected_features}")

# Transform the data to keep only the selected features
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

print(f"Original X_train shape: {X_train.shape}")
print(f"Transformed X_train shape: {X_train_selected.shape}")

Using with a Refiner Estimator

You can provide an optional refiner_estimator for a second stage of feature selection. This is useful for model-specific refinement.

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE # Recursive Feature Elimination
from pyrobustfs.selectors import RobustMRMRSelector

# Generate synthetic classification data
X, y = make_classification(n_samples=1000, n_features=50, n_informative=10, n_redundant=5, random_state=42)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define a refiner estimator (e.g., RFE with Logistic Regression)
# The refiner will operate on the features pre-selected by the ensemble mRMR.
refiner = RFE(estimator=LogisticRegression(solver='liblinear', random_state=42), n_features_to_select=3)

# Initialize RobustMRMRSelector with the refiner
selector_with_refiner = RobustMRMRSelector(
    n_features_to_select=5, # This is the target for ensemble mRMR, refiner might override
    n_ensembles=10,
    refiner_estimator=refiner,
    classification=True,
    random_state=42
)

selector_with_refiner.fit(X_train, y_train)
selected_features_refiner = selector_with_refiner.get_feature_names_out()
print(f"Selected features (with refiner): {selected_features_refiner}")
print(f"Number of features selected by refiner: {len(selected_features_refiner)}")

# Transform data
X_train_refined = selector_with_refiner.transform(X_train)
print(f"Transformed X_train shape (with refiner): {X_train_refined.shape}")

Integrating into a Scikit-learn Pipeline

RobustMRMRSelector can be seamlessly integrated into a scikit-learn Pipeline.

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from pyrobustfs.selectors import RobustMRMRSelector

# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=50, n_informative=10, n_redundant=5, random_state=42)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a pipeline with feature selection and a classifier
pipeline = Pipeline([
    ('feature_selection', RobustMRMRSelector(n_features_to_select=5, n_ensembles=10, classification=True, random_state=42)),
    ('classifier', LogisticRegression(solver='liblinear', random_state=42))
])

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Evaluate the pipeline
accuracy = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {accuracy:.4f}")

# Access selected features from the pipeline step
selected_features_pipeline = pipeline.named_steps['feature_selection'].get_feature_names_out()
print(f"Selected features from pipeline: {selected_features_pipeline}")

Development

To contribute or run tests, clone the repository and install in editable mode:

git clone https://github.com/yourusername/pyrobustfs.git
cd pyrobustfs
pip install -e .
pip install pytest
pytest

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyrobustfs-0.1.0.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyrobustfs-0.1.0-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file pyrobustfs-0.1.0.tar.gz.

File metadata

  • Download URL: pyrobustfs-0.1.0.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for pyrobustfs-0.1.0.tar.gz
Algorithm Hash digest
SHA256 218a862cf0d9476c4a024b16f28a951c10c5d8617c2bad49b96236d812407fc5
MD5 73dc6b5c627e317a37d9665c2079b02c
BLAKE2b-256 18e754fb261e7f294cb798a5ed60e851d034830b5d54754d371c8ae2efbc4ef1

See more details on using hashes here.

File details

Details for the file pyrobustfs-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pyrobustfs-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for pyrobustfs-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3f9e5f6b56c6146557b154dc442ebe1303f14579ef473c3720a8c5857cb51872
MD5 6d6fc6a3a93712d3e29765c29e4a8e07
BLAKE2b-256 ed0a4d9fab688ae81b105c2f5301064667bfb4adbe59dc27efe7912d5552b38d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page