A robust feature selection library using ensemble mRMR and optional refinement.
Project description
pyrobustfs
A robust feature selection library for Python, leveraging ensemble Minimum Redundancy Maximum Relevance (mRMR) with an optional refinement step. Designed for seamless integration into scikit-learn pipelines.
Features
- Ensemble mRMR: Improves robustness and stability of feature selection by running mRMR on bootstrapped/subsampled data and aggregating results.
- Scikit-learn Compatibility: Implements
BaseEstimatorandTransformerMixinfor easy integration into scikit-learn pipelines,GridSearchCV, andRandomizedSearchCV. - Flexible Refinement: Allows for an optional second stage of model-specific feature selection using any scikit-learn compatible estimator (e.g., RFE, SelectFromModel).
- Classification and Regression Support: Handles both classification (using mutual information for classification) and regression (using mutual information for regression) tasks.
Installation
Currently, pyrobustfs is not yet available on PyPI. You can install it directly from the source code:
-
Clone the repository:
git clone https://github.com/yourusername/pyrobustfs.git cd pyrobustfs
-
Install in editable mode (for development) or standard mode:
# For development (changes to code are immediately reflected) pip install -e . # For standard installation # pip install .
Usage
Basic Feature Selection
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from pyrobustfs.selectors import RobustMRMRSelector
# Generate synthetic classification data
X, y = make_classification(n_samples=1000, n_features=50, n_informative=10, n_redundant=5, random_state=42)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize and fit the selector
# Select 5 features using 10 ensemble runs for a classification task
selector = RobustMRMRSelector(n_features_to_select=5, n_ensembles=10, classification=True, random_state=42)
selector.fit(X_train, y_train)
# Get the names of the selected features
selected_features = selector.get_feature_names_out()
print(f"Selected features: {selected_features}")
# Transform the data to keep only the selected features
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)
print(f"Original X_train shape: {X_train.shape}")
print(f"Transformed X_train shape: {X_train_selected.shape}")
Using with a Refiner Estimator
You can provide an optional refiner_estimator for a second stage of feature selection. This is useful for model-specific refinement.
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE # Recursive Feature Elimination
from pyrobustfs.selectors import RobustMRMRSelector
# Generate synthetic classification data
X, y = make_classification(n_samples=1000, n_features=50, n_informative=10, n_redundant=5, random_state=42)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Define a refiner estimator (e.g., RFE with Logistic Regression)
# The refiner will operate on the features pre-selected by the ensemble mRMR.
refiner = RFE(estimator=LogisticRegression(solver='liblinear', random_state=42), n_features_to_select=3)
# Initialize RobustMRMRSelector with the refiner
selector_with_refiner = RobustMRMRSelector(
n_features_to_select=5, # This is the target for ensemble mRMR, refiner might override
n_ensembles=10,
refiner_estimator=refiner,
classification=True,
random_state=42
)
selector_with_refiner.fit(X_train, y_train)
selected_features_refiner = selector_with_refiner.get_feature_names_out()
print(f"Selected features (with refiner): {selected_features_refiner}")
print(f"Number of features selected by refiner: {len(selected_features_refiner)}")
# Transform data
X_train_refined = selector_with_refiner.transform(X_train)
print(f"Transformed X_train shape (with refiner): {X_train_refined.shape}")
Integrating into a Scikit-learn Pipeline
RobustMRMRSelector can be seamlessly integrated into a scikit-learn Pipeline.
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from pyrobustfs.selectors import RobustMRMRSelector
# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=50, n_informative=10, n_redundant=5, random_state=42)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create a pipeline with feature selection and a classifier
pipeline = Pipeline([
('feature_selection', RobustMRMRSelector(n_features_to_select=5, n_ensembles=10, classification=True, random_state=42)),
('classifier', LogisticRegression(solver='liblinear', random_state=42))
])
# Fit the pipeline
pipeline.fit(X_train, y_train)
# Evaluate the pipeline
accuracy = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {accuracy:.4f}")
# Access selected features from the pipeline step
selected_features_pipeline = pipeline.named_steps['feature_selection'].get_feature_names_out()
print(f"Selected features from pipeline: {selected_features_pipeline}")
Development
To contribute or run tests, clone the repository and install in editable mode:
git clone https://github.com/yourusername/pyrobustfs.git
cd pyrobustfs
pip install -e .
pip install pytest
pytest
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyrobustfs-0.1.1.tar.gz.
File metadata
- Download URL: pyrobustfs-0.1.1.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29b3155dc45857e16ecb7575e39908c577eb783b2890d0831667a9b18b0b5994
|
|
| MD5 |
71f69875adf51baa2cd477eba8c63906
|
|
| BLAKE2b-256 |
255cac3f1a1a566c9dd24d79bfc90d66fb3c55f1876d5a54161d80bc0b358210
|
File details
Details for the file pyrobustfs-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pyrobustfs-0.1.1-py3-none-any.whl
- Upload date:
- Size: 9.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd5408944b547c92b9186b4397079e22366ddae82f95915f26844bb40bdd1f98
|
|
| MD5 |
8e18c1ef84f0822d4639144648498e35
|
|
| BLAKE2b-256 |
1a9f6618865593fe0ee77d17b05cf2a970cdfea5f51eeb92b494be476c6f437f
|