Skip to main content

feature selection library

Project description

ci PyPI version fury.io PyPI license PRs Welcome Downloads

Selective: Feature Selection Library

Selective is a white-box feature selection library that supports supervised and unsupervised selection methods for classification and regression tasks.

The library provides:

  • Simple to complex selection methods: Variance, Correlation, Statistical, Linear, Tree-based, or Customized.
  • Text-based selection to maximize diversity in text embeddings and metadata coverage.
  • Interoperable with data frames as the input.
  • Automated task detection. No need to know what feature selection method works with what machine learning task.
  • Benchmarking multiple selectors using cross-validation with built-in parallelization.
  • Inspection of the results and feature importance.

Selective also provides optimized item selection based on diversity of text embeddings via TextWiser and coverage of binary labels via multi-objective optimization (AMAI'24, CPAIOR'21, DSO@IJCAI'22). This approach speeds-up online experimentation and boosts recommender systems significantly as presented at NVIDIA GTC'22.

Selective is developed by the Artificial Intelligence Center of Excellence at Fidelity Investments.

Quick Start

# Import Selective and SelectionMethod
from sklearn.datasets import fetch_california_housing
from feature.utils import get_data_label
from feature.selector import Selective, SelectionMethod

# Data
data, label = get_data_label(fetch_california_housing())

# Feature selectors from simple to more complex
selector = Selective(SelectionMethod.Variance(threshold=0.0))
selector = Selective(SelectionMethod.Correlation(threshold=0.5, method="pearson"))
selector = Selective(SelectionMethod.Statistical(num_features=3, method="anova"))
selector = Selective(SelectionMethod.Linear(num_features=3, regularization="none"))
selector = Selective(SelectionMethod.TreeBased(num_features=3))

# Feature reduction
subset = selector.fit_transform(data, label)
print("Reduction:", list(subset.columns))
print("Scores:", list(selector.get_absolute_scores()))

Available Methods

Method Options
Variance per Feature threshold
Correlation pairwise Features Pearson Correlation Coefficient
Kendall Rank Correlation Coefficient
Spearman's Rank Correlation Coefficient
Statistical Analysis ANOVA F-test Classification
F-value Regression
Chi-Square
KL Divergence
Mutual Information Classification
Variance Inflation Factor
Linear Methods Linear Regression
Logistic Regression
Lasso Regularization
Ridge Regularization
Tree-based Methods Decision Tree
Random Forest
Extra Trees Classifier
XGBoost
LightGBM
AdaBoost
CatBoost
Gradient Boosting Tree
Text-based Methods featurization_method = TextWiser
optimization_method = ["exact", "greedy", "kmeans", "random"]
cost_metric = ["unicost", "diverse"]

Benchmarking

# Imports
from sklearn.datasets import fetch_california_housing
from feature.utils import get_data_label
from xgboost import XGBClassifier, XGBRegressor
from feature.selector import SelectionMethod, benchmark, calculate_statistics

# Data
data, label = get_data_label(fetch_california_housing())

# Selectors
corr_threshold = 0.5
num_features = 3
tree_params = {"n_estimators": 50, "max_depth": 5, "random_state": 111, "n_jobs": 4}
selectors = {

  # Correlation methods
  "corr_pearson": SelectionMethod.Correlation(corr_threshold, method="pearson"),
  "corr_kendall": SelectionMethod.Correlation(corr_threshold, method="kendall"),
  "corr_spearman": SelectionMethod.Correlation(corr_threshold, method="spearman"),
  
  # Statistical methods
  "stat_anova": SelectionMethod.Statistical(num_features, method="anova"),
  "stat_chi_square": SelectionMethod.Statistical(num_features, method="chi_square"),
  "stat_kl_divergence": SelectionMethod.Statistical(num_features, method="kl_divergence"),
  "stat_mutual_info": SelectionMethod.Statistical(num_features, method="mutual_info"),
  
  # Linear methods
  "linear": SelectionMethod.Linear(num_features, regularization="none"),
  "lasso": SelectionMethod.Linear(num_features, regularization="lasso", alpha=1000),
  "ridge": SelectionMethod.Linear(num_features, regularization="ridge", alpha=1000),
  
  # Non-linear tree-based methods
  "random_forest": SelectionMethod.TreeBased(num_features),
  "xgboost_classif": SelectionMethod.TreeBased(num_features, estimator=XGBClassifier(**tree_params)),
  "xgboost_regress": SelectionMethod.TreeBased(num_features, estimator=XGBRegressor(**tree_params))
}

# Benchmark (sequential)
score_df, selected_df, runtime_df = benchmark(selectors, data, label, cv=5)
print(score_df, "\n\n", selected_df, "\n\n", runtime_df)

# Benchmark (in parallel)
score_df, selected_df, runtime_df = benchmark(selectors, data, label, cv=5, n_jobs=4)
print(score_df, "\n\n", selected_df, "\n\n", runtime_df)

# Get benchmark statistics by feature
stats_df = calculate_statistics(score_df, selected_df)
print(stats_df)

Text-based Selection

This example shows how to use text-based selection. In this scenario, we would like to select a subset of articles that is most diverse in the text embedding space and covers a range of topics.

# Import Selective and TextWiser
import pandas as pd
from feature.selector import Selective, SelectionMethod
from textwiser import TextWiser, Embedding, Transformation

# Data with the text content of each article
data = pd.DataFrame({"article_1": ["article text here"],
                     "article_2": ["article text here"],
                     "article_3": ["article text here"],
                     "article_4": ["article text here"],
                     "article_5": ["article text here"]})

# Labels to denote 0/1 coverage metadata for each article 
# across four labels, e.g., sports, international, entertainment, science    
labels = pd.DataFrame({"article_1": [1, 1, 0, 1],
                       "article_2": [0, 1, 0, 0],
                       "article_3": [0, 0, 1, 0],
                       "article_4": [0, 0, 1, 1],
                       "article_5": [1, 1, 1, 0]},
                      index=["label_1", "label_2", "label_3", "label_4"])

# TextWiser featurization method to create text embeddings
textwiser = TextWiser(Embedding.TfIdf(), Transformation.NMF(n_components=20))

# Text-based selection
# The goal is to select a subset of articles 
# that is most diverse in the text embedding space of articles
# and covers the most labels in each topic
selector = Selective(SelectionMethod.TextBased(num_features=2, featurization_method=textwiser))

# Feature reduction
subset = selector.fit_transform(data, labels)
print("Reduction:", list(subset.columns))

Visualization

import pandas as pd
from sklearn.datasets import fetch_california_housing
from feature.utils import get_data_label
from feature.selector import SelectionMethod, Selective, plot_importance

# Data
data, label = get_data_label(fetch_california_housing())

# Feature Selector
selector = Selective(SelectionMethod.Linear(num_features=8, regularization="none"))
subset = selector.fit_transform(data, label)

# Plot Feature Importance
df = pd.DataFrame(selector.get_absolute_scores(), index=data.columns)
plot_importance(df)

Installation

Selective requires Python 3.8+ and can be installed from PyPI using pip install selective.

Source

Alternatively, you can build a wheel package on your platform from scratch using the source code:

git clone https://github.com/fidelity/selective.git
cd selective
pip install setuptools wheel # if wheel is not installed
python setup.py sdist bdist_wheel
pip install dist/selective-X.X.X-py3-none-any.whl

Test your setup

cd selective
python -m unittest discover tests

Citation

If you use Selective in a publication, please cite it as:

    @article{DBLP:journals/amai/HaDVH98,
    author       = {Kad\i{}o\u{g}lu, Serdar and Kleynhans, Bernard and Wang, Xin},
    title        = {Integrating optimized item selection with active learning for continuous exploration in recommender systems},
    journal      = {Ann. Math. Artif. Intell.},
    year         = {2024},
    url          = {https://doi.org/10.1007/s10472-024-09941-x},
    doi          = {10.1007/s10472-024-09941-x},
    }
}

Support

Please submit bug reports and feature requests as Issues.

License

Selective is licensed under Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

selective-1.2.0.tar.gz (48.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

selective-1.2.0-py3-none-any.whl (39.0 kB view details)

Uploaded Python 3

File details

Details for the file selective-1.2.0.tar.gz.

File metadata

  • Download URL: selective-1.2.0.tar.gz
  • Upload date:
  • Size: 48.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.10

File hashes

Hashes for selective-1.2.0.tar.gz
Algorithm Hash digest
SHA256 7812c69401d23d6dcaba7ad7312a6da3857b77fda643c9d3dd7040194ae0f241
MD5 b187c99943771f9e28ed00e3188c96f5
BLAKE2b-256 9b5c325851a0f7c8599fc1475d76181b1f0b654a7085913eeaf1f3f3e662bab1

See more details on using hashes here.

File details

Details for the file selective-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: selective-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 39.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.10

File hashes

Hashes for selective-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 75ba0074ccade62c92cce9087788d28f720f7a2d2e2507b76b305d8146fcde5b
MD5 c32252ce0bb38ee4fbd1608096e616db
BLAKE2b-256 b1cc64a43cef8e8e76afe8c46be5837b0a90a9f20de458c7b4f7fbb612ba5aa3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page