Skip to main content

Feature Selection using Metaheuristics Made Easy: Open Source MAFESE Library in Python

Project description

MAFESE


GitHub release Wheel PyPI version PyPI - Python Version PyPI - Downloads Downloads Run Tests Documentation Status Chat DOI License: GPL v3


MAFESE (Metaheuristic Algorithms for FEature SElection) is the largest open-source Python library dedicated to the feature selection (FS) problem using metaheuristic algorithms. It contains filter, wrapper, embedded, and unsupervised-based methods with modern optimization techniques. Whether you're tackling classification or regression tasks, MAFESE helps automate and enhance feature selection to improve model performance.


🔥 Key Features

  • 🆓 Free software: GNU General Public License (GPL) V3 license
  • 🔄 Total Wrapper-based (Metaheuristic Algorithms): > 200 methods
  • 📊 Total Filter-based (Statistical-based): > 15 methods
  • 🌳 Total Embedded-based (Tree and Lasso): > 10 methods
  • 🔍 Total Unsupervised-based: ≥ 4 methods
  • 📂 Built-in Datasets: ≥ 30 datasets (47 classifications, 7 regressions)
  • 📈 Total performance metrics: ≥ 61 (45 regressions and 16 classifications)
  • ⚙️ Total objective functions (as fitness functions): ≥ 61 (45 regressions and 16 classifications)
  • 📖 Documentation: https://mafese.readthedocs.io/en/latest/
  • 🐍 Python versions: ≥ 3.8.x
  • 📦 Dependencies: numpy, scipy, scikit-learn, pandas, mealpy, permetrics, plotly, kaleido

🎯 Goals

MAFESE provides all state-of-the-art feature selection (FS) methods:

  • 🧠 Unsupervised-based FS

  • 🔎 Filter-based FS

  • 🌲 Embedded-based FS

    • Regularization (Lasso-based)
    • Tree-based methods
  • ⚙️ Wrapper-based FS

    • Sequential-based: forward and backward
    • Recursive-based
    • MHA-based: Metaheuristic Algorithms

📝 Citation

Please include these citations if you plan to use this incredible library:

@article{van2024feature,
  title={Feature selection using metaheuristics made easy: Open source MAFESE library in Python},
  author={Van Thieu, Nguyen and Nguyen, Ngoc Hung and Heidari, Ali Asghar},
  journal={Future Generation Computer Systems},
  year={2024},
  publisher={Elsevier},
  doi={10.1016/j.future.2024.06.006},
  url={https://doi.org/10.1016/j.future.2024.06.006},
}

@article{van2023mealpy,
  title={MEALPY: An open-source library for latest meta-heuristic algorithms in Python},
  author={Van Thieu, Nguyen and Mirjalili, Seyedali},
  journal={Journal of Systems Architecture},
  year={2023},
  publisher={Elsevier},
  doi={10.1016/j.sysarc.2023.102871}
}

Installation

Install the latest release from PyPI:

$ pip install mafese

After installation, check the version:

$ python
>>> import mafese
>>> mafese.__version__

🚀 Quick Start

1. Load Dataset

Use a built-in dataset:

from mafese import get_dataset
data = get_dataset("Arrhythmia")

Or load your own:

import pandas as pd
from mafese import Data

df = pd.read_csv('examples/dataset.csv', index_col=0).values
X, y = df[:, :-1], df[:, -1]
data = Data(X, y)

2. Next, prepare your dataset

Split Train/Test

data.split_train_test(test_size=0.2)
print(data.X_train[:2].shape)
print(data.y_train[:2].shape)

Scale Features and Labels

data.X_train, scaler_X = data.scale(data.X_train, scaling_methods=("standard", "minmax"))
data.X_test = scaler_X.transform(data.X_test)

data.y_train, scaler_y = data.encode_label(data.y_train)  # Classification only
data.y_test = scaler_y.transform(data.y_test)

3. Select Feature Selection Method

## First way, we recommended 
from mafese import UnsupervisedSelector, FilterSelector, LassoSelector, TreeSelector
from mafese import SequentialSelector, RecursiveSelector, MhaSelector, MultiMhaSelector

## Second way
from mafese.unsupervised import UnsupervisedSelector
from mafese.filter import FilterSelector
from mafese.embedded.lasso import LassoSelector
from mafese.embedded.tree import TreeSelector
from mafese.wrapper.sequential import SequentialSelector
from mafese.wrapper.recursive import RecursiveSelector
from mafese.wrapper.mha import MhaSelector, MultiMhaSelector

4. Next, create an instance of Selector class you want to use:

feat_selector = UnsupervisedSelector(problem='classification', method='DR', n_features=5)

feat_selector = FilterSelector(problem='classification', method='SPEARMAN', n_features=5)

feat_selector = LassoSelector(problem="classification", estimator="lasso", estimator_paras={"alpha": 0.1})

feat_selector = TreeSelector(problem="classification", estimator="tree")

feat_selector = SequentialSelector(problem="classification", estimator="knn", n_features=3, direction="forward")

feat_selector = RecursiveSelector(problem="classification", estimator="rf", n_features=5)

feat_selector = MhaSelector(problem="classification",obj_name="AS",
                            estimator="knn", estimator_paras=None,
                            optimizer="BaseGA", optimizer_paras=None,
                            mode='single', n_workers=None, termination=None, seed=None, verbose=True)

feat_selector = MultiMhaSelector(problem="classification", obj_name="AS",
                                 estimator="knn", estimator_paras=None,
                                 list_optimizers=("OriginalWOA", "OriginalGWO", "OriginalTLO", "OriginalGSKA"), 
                                 list_optimizer_paras=[{"epoch": 10, "pop_size": 30}, ]*4,
                                 mode='single', n_workers=None, termination=None, seed=None, verbose=True)

5. Fit the model to X_train and y_train

feat_selector.fit(data.X_train, data.y_train)

6. Get the information

# check selected features - True (or 1) is selected, False (or 0) is not selected
print(feat_selector.selected_feature_masks)
print(feat_selector.selected_feature_solution)

# check the index of selected features
print(feat_selector.selected_feature_indexes)

7. Call transform() on the X that you want to filter it down to selected features

X_train_selected = feat_selector.transform(data.X_train)
X_test_selected = feat_selector.transform(data.X_test)

8.You can build your own evaluating method or use our method.

If you use our method, don't transform the data.

8.1 You can use difference estimator than the one used in feature selection process

feat_selector.evaluate(estimator="svm", data=data, metrics=["AS", "PS", "RS"])

## Here, we pass the data that was loaded above. So it contains both train and test set. So, the results will look 
like this: 
{'AS_train': 0.77176, 'PS_train': 0.54177, 'RS_train': 0.6205, 'AS_test': 0.72636, 'PS_test': 0.34628, 'RS_test': 0.52747}

8.2 You can use the same estimator in feature selection process

X_test, y_test = data.X_test, data.y_test
feat_selector.evaluate(estimator=None, data=data, metrics=["AS", "PS", "RS"])

For more usage examples please look at examples folder.

❓ Troubleshooting

  1. Where do I find the supported metrics like above ["AS", "PS", "RS"]. What is that?

You can find it here: https://github.com/thieu1995/permetrics or use this

from mafese import MhaSelector 

print(MhaSelector.SUPPORTED_REGRESSION_METRICS)
print(MhaSelector.SUPPORTED_CLASSIFICATION_METRICS)
  1. How do I know my Selector support which estimator? which methods?
print(feat_selector.SUPPORT) 

Or you better read the document from: https://mafese.readthedocs.io/en/latest/

  1. I got this type of error. How to solve it?
raise ValueError("Existed at least one new label in y_pred.")
ValueError: Existed at least one new label in y_pred.

This occurs only when you are working on a classification problem with a small dataset that has many classes. For instance, the "Zoo" dataset contains only 101 samples, but it has 7 classes. If you split the dataset into a training and testing set with a ratio of around 80% - 20%, there is a chance that one or more classes may appear in the testing set but not in the training set. As a result, when you calculate the performance metrics, you may encounter this error. You cannot predict or assign new data to a new label because you have no knowledge about the new label. There are several solutions to this problem.

  • 1st: Use the SMOTE method to address imbalanced data and ensure that all classes have the same number of samples.
from imblearn.over_sampling import SMOTE
import pandas as pd
from mafese import Data

dataset = pd.read_csv('examples/dataset.csv', index_col=0).values
X, y = dataset[:, 0:-1], dataset[:, -1]

X_new, y_new = SMOTE().fit_resample(X, y)
data = Data(X_new, y_new)
  • 2nd: Use different random_state numbers in split_train_test() function.
import pandas as pd 
from mafese import Data 

dataset = pd.read_csv('examples/dataset.csv', index_col=0).values
X, y = dataset[:, 0:-1], dataset[:, -1]
data = Data(X, y)
data.split_train_test(test_size=0.2, random_state=10)   # Try different random_state value 

📞 Community & Support


Developed by: Thieu @ 2023

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mafese-1.0.0.tar.gz (4.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mafese-1.0.0-py3-none-any.whl (4.2 MB view details)

Uploaded Python 3

File details

Details for the file mafese-1.0.0.tar.gz.

File metadata

  • Download URL: mafese-1.0.0.tar.gz
  • Upload date:
  • Size: 4.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mafese-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ff6dbfce37ef9dec5a912ea3cb8a5fe5944a0302e000fd9a078b758b9b9b7850
MD5 b99f83d62dc3ca94c22d83c290a4371f
BLAKE2b-256 8cfd65c9356e1db177f6b4b8e42a3779eb8c9000420ad491668811aa295b6638

See more details on using hashes here.

Provenance

The following attestation bundles were made for mafese-1.0.0.tar.gz:

Publisher: publish.yml on thieu1995/mafese

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mafese-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: mafese-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 4.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mafese-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1e039f5851dd02b34c733bb9e0736b851bdedc7fdfe5a76a5e84c972c412abfe
MD5 1abb764c44d32bc7295328a1f00a0c4b
BLAKE2b-256 5493a978d1b36d334ac3c5dab1dc592c2ed79822c12a0101c450ab4291f33cb6

See more details on using hashes here.

Provenance

The following attestation bundles were made for mafese-1.0.0-py3-none-any.whl:

Publisher: publish.yml on thieu1995/mafese

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page