PyImpetus is a Markov Blanket based feature selection algorithm which considers features both separately and together as a group in order to provide not just the best set of features but also the best combination of features

These details have not been verified by PyPI

Project links

Homepage

Operating System
- OS Independent
Programming Language

Project description

PyImpetus

PyImpetus is a Markov Blanket based feature selection algorithm that selects a subset of features by considering their performance both individually as well as a group. This allows the algorithm to not only select the best set of features, but also select the best set of features that play well with each other. For example, the best performing feature might not play well with others while the remaining features, when taken together could out-perform the best feature. PyImpetus takes this into account and produces the best possible combination. Thus, the algorithm provides a minimal feature subset. So, you do not have to decide how many features to take. PyImpetus selects the optimal set for you.

PyImpetus has been completely revamped and now supports binary classification, multi-class classification and regression tasks. It uses a novel CV based aggregation method to recommend the most roubst set of minimal features (Markov Blanket).

PyImpetus was tested on 13 datasets and outperformed state-of-the-art Markov Blanket learning algorithms on all of them along with traditional feature selection algorithms such as Forward Feature Selection, Backward Feature Elimination and Recursive Feature Elimination.

How to install?

pip install PyImpetus

Functions and parameters

# The initialization of PyImpetus takes in multiple parameters as input
# PPIMBC is for classification
model = PPIMBC(model, p_val_thresh, num_simul, cv, verbose, random_state, n_jobs)

model - estimator object, default=DecisionTreeClassifier() The model which is used to perform classification in order to find feature importance via significance-test.
p_val_thresh - float, default=0.05 The p-value (in this case, feature importance) below which a feature will be considered as a candidate for the final MB.
num_simul - int, default=10 (This feature has huge impact on speed) Number of train-test splits to perform to check usefulness of each feature. For large datasets, the value should be considerably reduced though do not go below 5.
cv - cv object/int, default=5 Determines the the number of splits for cross-validation. Sklearn CV object can also be passed.
verbose - int, default=0 Controls the verbosity: the higher, more the messages.
random_state - int or RandomState instance, default=None Pass an int for reproducible output across multiple function calls.
n_jobs - int, default=-1 The number of CPUs to use to do the computation.
- None means 1 unless in a :obj:joblib.parallel_backend context.
- -1 means using all processors.

# The initialization of PyImpetus takes in multiple parameters as input
# PPIMBC is for regression
model = PPIMBR(model, p_val_thresh, num_simul, cv, verbose, random_state, n_jobs)

model - estimator object, default=DecisionTreeRegressor() The model which is used to perform regression in order to find feature importance via significance-test.
p_val_thresh - float, default=0.05 The p-value (in this case, feature importance) below which a feature will be considered as a candidate for the final MB.
num_simul - int, default=10 (This feature has huge impact on speed) Number of train-test splits to perform to check usefulness of each feature. For large datasets, the value should be considerably reduced though do not go below 5.
cv - cv object/int, default=5 Determines the the number of splits for cross-validation. Sklearn CV object can also be passed.
verbose - int, default=0 Controls the verbosity: the higher, more the messages.
random_state - int or RandomState instance, default=None Pass an int for reproducible output across multiple function calls.
n_jobs - int, default=-1 The number of CPUs to use to do the computation.
- None means 1 unless in a :obj:joblib.parallel_backend context.
- -1 means using all processors.

# To fit PyImpetus on provided dataset and find recommended features
fit(data, target)

data - A pandas dataframe upon which feature selection is to be applied
target - A numpy array, denoting the target variable

# This function returns the names of the columns that form the MB (These are the recommended features)
transform(data)

data - A pandas dataframe which needs to be pruned

# To fit PyImpetus on provided dataset and return pruned data
fit_transform(data, target)

data - A pandas dataframe upon which feature selection is to be applied
target - A numpy array, denoting the target variable

# To plot XGBoost style feature importance
feature_importance()

How to import?

from PyImeptus import PPIMBC, PPIMBR

Usage

# Import the algorithm. PPIMBC is for classification and PPIMBR is for regression
from PyImeptus import PPIMBC, PPIMBR
# Initialize the PyImpetus object
model = PPIMBC(model=SVC(random_state=27, class_weight="balanced"), p_val_thresh=0.05, num_simul=30, cv=5, random_state=27, n_jobs=-1, verbose=2)
# The fit_transform function is a wrapper for the fit and transform functions, individually.
# The fit function finds the MB for given data while transform function provides the pruned form of the dataset
df_train = model.fit_transform(df_train.drop("Response", axis=1), df_train["Response"].values)
df_test = model.transform(df_test)
# Check out the MB
print(model.MB)
# Check out the feature importance scores for the selected feature subset
print(model.feat_imp_scores)
# Get a plot of the feature importance scores
model.feature_importance()

For better accuracy

Increase the cv value
Increase the num_simul value

For better speeds

Decrease the cv value. For large datasets cv might not be required. Therefore, set cv=0 to disable the aggregation step. This will result in less robust feature subset selection but at much faster speeds
Decrease the num_simul value but don't decrease it below 5
Set n_jobs to -1

For selection of less features

Try reducing the p_val_thresh value

Timeit!

On a dataset of 381,110 samples and 10 features, PyImpetus took 77.6 seconds to find the best set of minimal features. This is in contrast with the previous version of PyImpetus which took 609 seconds for the same dataset. This test was performed on a 10th gen corei7 with n_jobs set to -1.

Tutorials

You can find a usage tutorial here.

Future Ideas

Let me know

Feature Request

Drop me an email at atif.hit.hassan@gmail.com if you want any particular feature

Please cite this work as

Reference to the upcoming paper will be added here

Project details

These details have not been verified by PyPI

Project links

Homepage

Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

4.1.2

Feb 22, 2022

4.1.1

Feb 22, 2022

4.1.0

Feb 22, 2022

4.0.1

Apr 9, 2021

4.0

Apr 8, 2021

3.1.4

Apr 4, 2021

3.1.3

Apr 3, 2021

3.1.2

Apr 2, 2021

3.1.1

Mar 31, 2021

3.1

Mar 30, 2021

3.0.1

Mar 30, 2021

3.0

Mar 30, 2021

This version

2.4

Mar 29, 2021

2.3

Mar 29, 2021

2.2

Mar 28, 2021

2.0.4

Jan 5, 2021

2.0.3

Jan 5, 2021

2.0.2

Jan 5, 2021

2.0.1

Jan 5, 2021

2.0.0

Jan 5, 2021

1.1.4

Oct 7, 2020

1.1.3

Sep 26, 2020

1.1.2

Sep 22, 2020

1.1.1

Sep 22, 2020

1.1.0

Sep 22, 2020

1.0.1

Sep 20, 2020

1.0.0

Sep 20, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyImpetus-2.4.tar.gz (9.5 kB view details)

Uploaded Mar 29, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

PyImpetus-2.4-py3-none-any.whl (8.0 kB view details)

Uploaded Mar 29, 2021 Python 3

File details

Details for the file PyImpetus-2.4.tar.gz.

File metadata

Download URL: PyImpetus-2.4.tar.gz
Upload date: Mar 29, 2021
Size: 9.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.8.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.8

File hashes

Hashes for PyImpetus-2.4.tar.gz
Algorithm	Hash digest
SHA256	`7daae95bc8c11f7f413d7a4d25f84001fdf9f2bd0187eb597b61666b8a10b94e`
MD5	`27668a9756491f1af984590b781cc9d6`
BLAKE2b-256	`11f524cb93f0a77f6cbbf8ba14282d258538d84f1835871f734f55e5d48e1c08`

See more details on using hashes here.

File details

Details for the file PyImpetus-2.4-py3-none-any.whl.

File metadata

Download URL: PyImpetus-2.4-py3-none-any.whl
Upload date: Mar 29, 2021
Size: 8.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.8.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.8

File hashes

Hashes for PyImpetus-2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d756538af7b70ee59de80ed70bdd4d15d637fddfe478a30ea73ee1daad5b8fdc`
MD5	`bbcc76363730a4c3a4eb60c2e5948176`
BLAKE2b-256	`34e2aa3ba14075824e252e18328ce535b3724087e760ca29db57c6cf455d24a8`

See more details on using hashes here.

PyImpetus 2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyImpetus

How to install?

Functions and parameters

How to import?

Usage

For better accuracy

For better speeds

For selection of less features

Timeit!

Tutorials

Future Ideas

Feature Request

Please cite this work as

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes