Select Best Features from your data set - any size - now with XGBoost!

These details have not been verified by PyPI

Project links

Homepage

Project description

featurewiz

banner

Featurewiz is a new python library for selecting the best features in your data set fast! (featurewiz logo created using Wix)

Two methods are used in this version of featurewiz:
1. SULOV -> SULOV means Searching for Uncorrelated List of Variables. The SULOV method is explained in this chart below. Here is a simple way of explaining how it works:

Find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.7)).
Then find their MIS score (Mutual Information Score) to the target variable. MIS is a non-parametric scoring method. So its suitable for all kinds of variables and target.
Now take each pair of correlated variables, then knock off the one with the lower MIS score.
What’s left is the ones with the highest Information scores and least correlation with each other.

sulov

Recursive XGBoost: Once SULOV has selected variables that have high mutual information scores with least less correlation amongst them, we use XGBoost to repeatedly find best features among the remaining variables after SULOV. The Recursive XGBoost method is explained in this chart below. Here is how it works:

Select all variables in data set and the full data split into train and valid sets.
Find top X features (could be 10) on train using valid for early stopping (to prevent over-fitting)
Then take next set of vars and find top X
Do this 5 times. Combine all selected features and de-duplicate them.

xgboost

Performing Feature Engineering: One of the gaps in open source AutoML tools and especially Auto_ViML has been the lack of feature engineering capabilities that high powered competitions like Kaggle required. The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables was difficult and sifting through those hundreds of new features was painstaking and left only to "experts". Now there is some good news. featurewiz (https://lnkd.in/eGep5uG) now enables you to add hundreds of such features at the click of a code. Set the "feature_engg" flag to "interactions", "groupby" or "target" and featurewiz will select the best encoders for each of those options and create hundreds (perhaps thousands) of features in one go. Not only that, it will use SULOV method and Recursive XGBoost to sift through those variables and find only the least correlated and most important features among them. All in one step!.
Building the simplest and most "interpretable" model: Featurewiz represents the "next best" step you must perform after doing feature engineering since you might have added some highly correlated or even useless features when you use automated feature engineering. featurewiz ensures you have the least number of features needed to build a high performing or equivalent model.

A WORD OF CAUTION: Just because you can, doesn't mean you should. Make sure you understand feature engineered variables before you attempt to build your model any further. featurewiz displays the SULOV chart which can show you the 100's of newly created variables added to your dataset using featurewiz.
But you still have two problems:

How to interpret those newly created features?
Does the model overfit now on these many features?

Both are very important questions and you must be very careful using this feature_engg option in featurewiz. Otherwise, you can create a "garbage in, garbage out" problem. Caveat Emptor!

To upgrade to the best, most stable and full-featured version always do the following:
Use $ pip install featurewiz --upgrade --ignore-installed
or pip install git+https://github.com/AutoViML/featurewiz.git

Background
Install
Usage
API
Maintainers
Contributing
License

Background

To learn more about how featurewiz works under the hood, watch this video

featurewiz was designed for selecting High Performance variables with the fewest steps.

In most cases, featurewiz builds models with 20%-99% fewer features than your original data set with nearly the same or slightly lower performance (this is based on my trials. Your experience may vary).

featurewiz is every Data Scientist's feature wizard that will:

Automatically pre-process data: you can send in your entire dataframe "as is" and featurewiz will classify and change/label encode categorical variables changes to help XGBoost processing. It classifies variables as numeric or categorical or NLP or date-time variables automatically so it can use them correctly to model.
Perform feature engineering automatically: The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables is difficult and sifting through those hundreds of new features is painstaking and left only to "experts". Now, with featurewiz you can create hundreds or even thousands of new features with the click of a mouse. This is very helpful when you have a small number of features to start with. However, be careful with this option. You can very easily create a monster with this option.
Perform feature reduction automatically. When you have small data sets and you know your domain well, it is easy to perhaps do EDA and identify which variables are important. But when you have a very large data set with hundreds if not thousands of variables, selecting the best features from your model can mean the difference between a bloated and highly complex model or a simple model with the fewest and most information-rich features. featurewiz uses XGBoost repeatedly to perform feature selection. You must try it on your large data sets and compare!
Explain SULOV method graphically using networkx library so you can see which variables are highly correlated to which ones and which of those have high or low mutual information scores automatically. Just set verbose = 2 to see the graph.

*** Notes of Gratitude ***:

featurewiz is built using xgboost, numpy, pandas and matplotlib. It should run on most Python 3 Anaconda installations. You won't have to import any special libraries other than "XGBoost" and "networkx" library.
We use "networkx" library for charts and interpretability.
But if you don't have these libraries, featurewiz will install those for you automatically.
Alex Lekov (https://github.com/Alex-Lekov/AutoML_Alex/tree/master/automl_alex) for his DataBunch and encoders modules which are used by the tool (though with some modifications).
Category Encoders library in Python : This is an amazing library. Make sure you read all about the encoders that featurewiz uses here: https://contrib.scikit-learn.org/category_encoders/index.html

Install

Prerequsites:

Anaconda

To clone featurewiz, it is better to create a new environment, and install the required dependencies:

To install from PyPi:

conda create -n <your_env_name> python=3.7 anaconda
conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
pip install featurewiz
or
pip install git+https://github.com/AutoViML/featurewiz.git

To install from source:

cd <featurewiz_Destination>
git clone git@github.com:AutoViML/featurewiz.git
# or download and unzip https://github.com/AutoViML/featurewiz/archive/master.zip
conda create -n <your_env_name> python=3.7 anaconda
conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
cd featurewiz
pip install -r requirements.txt

Usage

In the same directory, open a Jupyter Notebook and use this line to import the .py file:

from featurewiz import featurewiz

Load a data set (any CSV or text file) into a Pandas dataframe and give it the name of the target(s) variable. If you have more than one target, it will handle multi-label targets too. Just give it a list of variables in that case. If you don't have a dataframe, you can simply enter the name and path of the file to load into featurewiz:

featurewiz(dataname, target, corr_limit=0.7, verbose=0, sep=",", header=0,
                      test_data='', feature_engg='', category_encoders='',
                      ```

Output: is a Tuple which contains the list of features selected, the dataframe modified with new features and the test data modified.
This list of selected features is ready for you to now to do further modeling.

featurewiz works on any Multi-Class, Multi-Label Data Set. So you can have as many target labels as you want.
You don't have to tell featurwiz whether it is a Regression or Classification problem. It will decide that automatically.

## API

**Arguments**

- `dataname`: could be a datapath+filename or a dataframe. It will detect whether your input is a filename or a dataframe and load it automatically.
- `target`: name of the target variable in the data set.
- `corr_limit`: if you want to set your own threshold for removing variables as highly correlated, then give it here. The default is 0.7 which means variables less than -0.7 and greater than 0.7 in pearson's correlation will be candidates for removal.
- `verbose`: This has 3 possible states:
  - `0` limited output. Great for running this silently and getting fast results.
  - `1` more verbiage. Great for knowing how results were and making changes to flags in input.
  - `2` SULOV charts and output. Great for finding out what happens under the hood for SULOV method.
`test_data`: If you want to transform test data in the same way you are transforming dataname, you can.
    test_data could be the name of a datapath+filename or a dataframe. featurewiz will detect whether
        your input is a filename or a dataframe and load it automatically. Default is empty string.
`feature_engg`: You can let featurewiz select its best encoders for your data set by settning this flag
    for adding feature engineering. There are three choices. You can choose one, two or all three.
    'interactions': This will add interaction features to your data such as x1*x2, x2*x3, x1**2, x2**2, etc.
    'groupby': This will generate Group By features to your numeric vars by grouping all categorical vars.
    'target':  This will encode & transform all your categorical features using certain target encoders.
    Default is empty string (which means no additional features)
`category_encoders`: Instead of above method, you can choose your own kind of category encoders from below.
    Recommend you do not use more than two of these. Featurewiz will automatically select only two from your list.
    Default is empty string (which means no encoding of your categorical features)
        ['HashingEncoder', 'SumEncoder', 'PolynomialEncoder', 'BackwardDifferenceEncoder',
        'OneHotEncoder', 'HelmertEncoder', 'OrdinalEncoder', 'FrequencyEncoder', 'BaseNEncoder',
        'TargetEncoder', 'CatBoostEncoder', 'WOEEncoder', 'JamesSteinEncoder']

**Return values**
If you don't want any feature_engg, then featurewiz will return just one thing:
- `features`: the fewest number of features in your model to make it perform well
Otherwise, Featurewiz can output either one dataframe or two depending on what you send inside as input.
    1. trainm: modified train dataframe is the dataframe that is modified with engineered and selected features from dataname.
    2. testm: modified test dataframe is the dataframe that is modified with engineered and selected features from test_data

## Maintainers

* [@AutoViML](https://github.com/AutoViML)

## Contributing

See [the contributing file](CONTRIBUTING.md)!

PRs accepted.

## License

Apache License 2.0 © 2020 Ram Seshadri

## DISCLAIMER
This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.6.1

Feb 19, 2025

0.6.0

Jan 29, 2025

0.5.83

Jan 28, 2025

0.5.82

Jan 28, 2025

0.5.81

Jan 27, 2025

0.5.7

Feb 10, 2024

0.5.6

Feb 2, 2024

0.5.5

Feb 1, 2024

0.5.4

Jan 23, 2024

0.5.3

Jan 16, 2024

0.5.2

Dec 22, 2023

0.5.1

Dec 22, 2023

0.5.0

Dec 20, 2023

0.4.8

Dec 15, 2023

0.4.7

Dec 5, 2023

0.4.6

Dec 3, 2023

0.4.5

Nov 29, 2023

0.4.4

Nov 1, 2023

0.4.3

Oct 31, 2023

0.4.2

Oct 31, 2023

0.4.1

Oct 30, 2023

0.4.0

Oct 30, 2023

0.3.3

Oct 5, 2023

0.3.2

May 28, 2023

0.3.1

May 16, 2023

0.3.0

May 1, 2023

0.2.8

Apr 22, 2023

0.2.7

Apr 20, 2023

0.2.6

Mar 30, 2023

0.2.5

Mar 29, 2023

0.2.4

Dec 30, 2022

0.2.3

Nov 23, 2022

0.2.2

Oct 24, 2022

0.2.1

Oct 18, 2022

0.2.0

Oct 10, 2022

0.1.996

Sep 27, 2022

0.1.995

Sep 21, 2022

0.1.994

Sep 21, 2022

0.1.993

Sep 18, 2022

0.1.991

Aug 30, 2022

0.1.99

Aug 21, 2022

0.1.96

Aug 20, 2022

0.1.95

Aug 15, 2022

0.1.92

Jul 26, 2022

0.1.91

Jul 11, 2022

0.1.90

Jul 5, 2022

0.1.88

Jul 5, 2022

0.1.87

Jun 30, 2022

0.1.86

Jun 30, 2022

0.1.85

Jun 30, 2022

0.1.83

Jun 27, 2022

0.1.82

Jun 23, 2022

0.1.81

Jun 23, 2022

0.1.80

Jun 21, 2022

0.1.76

Jun 20, 2022

0.1.75

Jun 20, 2022

0.1.74

Jun 20, 2022

0.1.73

Jun 19, 2022

0.1.72

Jun 19, 2022

0.1.71

Jun 17, 2022

0.1.70

Jun 15, 2022

0.1.60

Jun 6, 2022

0.1.55

May 26, 2022

0.1.54

May 25, 2022

0.1.53

May 25, 2022

0.1.52

May 22, 2022

0.1.51

May 22, 2022

0.1.50

May 21, 2022

0.1.44

May 16, 2022

0.1.43

May 16, 2022

0.1.41

Apr 25, 2022

0.1.32

Apr 21, 2022

0.1.31

Apr 18, 2022

0.1.30

Apr 17, 2022

0.1.28

Apr 13, 2022

0.1.27

Apr 11, 2022

0.1.26

Apr 5, 2022

0.1.25

Apr 5, 2022

0.1.22

Apr 3, 2022

0.1.21

Apr 3, 2022

0.1.20

Apr 2, 2022

0.1.14

Apr 2, 2022

0.1.13

Mar 21, 2022

0.1.12

Mar 17, 2022

0.1.11

Mar 16, 2022

0.1.9

Mar 16, 2022

0.1.8

Mar 15, 2022

0.1.7

Mar 15, 2022

0.1.6

Mar 13, 2022

0.1.5

Mar 11, 2022

0.1.4

Mar 2, 2022

0.1.3.dev1 pre-release

Feb 28, 2022

0.1.2

Feb 26, 2022

0.1.1

Feb 22, 2022

0.1.0

Feb 15, 2022

0.0.95

Feb 14, 2022

0.0.94

Feb 12, 2022

0.0.93

Feb 12, 2022

0.0.92

Feb 11, 2022

0.0.91

Jan 30, 2022

0.0.90

Jan 28, 2022

0.0.85

Jan 27, 2022

0.0.84

Jan 26, 2022

0.0.83

Jan 14, 2022

0.0.82

Jan 12, 2022

0.0.81

Jan 12, 2022

0.0.80

Jan 10, 2022

0.0.73

Jan 10, 2022

0.0.72

Jan 9, 2022

0.0.71

Jan 6, 2022

0.0.70

Jan 6, 2022

0.0.63

Jan 5, 2022

0.0.62

Jan 4, 2022

0.0.61

Jan 1, 2022

0.0.60

Jan 1, 2022

0.0.59

Dec 30, 2021

0.0.58

Dec 29, 2021

0.0.57

Dec 28, 2021

0.0.56

Dec 27, 2021

0.0.55

Dec 21, 2021

0.0.54

Dec 21, 2021

0.0.53

Dec 10, 2021

0.0.51

Nov 10, 2021

0.0.50

Nov 10, 2021

0.0.43

Nov 8, 2021

0.0.42

Jul 8, 2021

0.0.41

Jul 8, 2021

0.0.40

Jul 8, 2021

0.0.39

Jul 8, 2021

0.0.38

Jun 29, 2021

0.0.37

Jun 27, 2021

0.0.36

Jun 27, 2021

0.0.35

May 16, 2021

0.0.34

May 9, 2021

0.0.33

Mar 26, 2021

0.0.32

Mar 25, 2021

0.0.31

Mar 24, 2021

0.0.30

Mar 24, 2021

0.0.29

Mar 22, 2021

0.0.28

Mar 21, 2021

0.0.27

Mar 21, 2021

0.0.26

Mar 21, 2021

0.0.25

Mar 21, 2021

0.0.24

Mar 20, 2021

0.0.23

Mar 10, 2021

0.0.22

Mar 9, 2021

0.0.21

Mar 8, 2021

0.0.20

Mar 8, 2021

0.0.19

Mar 8, 2021

0.0.18

Feb 20, 2021

0.0.17

Feb 1, 2021

0.0.16

Jan 22, 2021

0.0.15

Jan 22, 2021

0.0.14

Dec 27, 2020

0.0.13

Dec 26, 2020

0.0.12

Dec 24, 2020

0.0.11

Dec 24, 2020

0.0.10

Dec 24, 2020

This version

0.0.9

Dec 24, 2020

0.0.8

Dec 24, 2020

0.0.7

Dec 10, 2020

0.0.6

Nov 29, 2020

0.0.5

Nov 29, 2020

0.0.4

Nov 29, 2020

0.0.3

Nov 29, 2020

0.0.2

Nov 29, 2020

0.0.1

Nov 29, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

featurewiz-0.0.9.tar.gz (33.7 kB view details)

Uploaded Dec 24, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

featurewiz-0.0.9-py3-none-any.whl (39.7 kB view details)

Uploaded Dec 24, 2020 Python 3

File details

Details for the file featurewiz-0.0.9.tar.gz.

File metadata

Download URL: featurewiz-0.0.9.tar.gz
Upload date: Dec 24, 2020
Size: 33.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.40.1 CPython/3.6.8

File hashes

Hashes for featurewiz-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`e480165b193f74cdb3122b4cb7d6d5a2c50b27a84a51e690afeba4db66733e94`
MD5	`24a8a1a38c6e7ebe043f7dfc448334c4`
BLAKE2b-256	`d5fcd0f1ac6475fd80c48c30d03a6b34d63e3623971b5eefa08079bad090031f`

See more details on using hashes here.

File details

Details for the file featurewiz-0.0.9-py3-none-any.whl.

File metadata

Download URL: featurewiz-0.0.9-py3-none-any.whl
Upload date: Dec 24, 2020
Size: 39.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.40.1 CPython/3.6.8

File hashes

Hashes for featurewiz-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f2b0d3488c65aba390556e60e4e8f7843dd078d6fa5fc9461f1389bc084f29da`
MD5	`2559b2d9409eab0bf69da1f15438e0a0`
BLAKE2b-256	`f774c65a8ed8393cbb4f2bd8eda2a9e07aede84f15761d866812d6e5cfa5fe45`

See more details on using hashes here.

featurewiz 0.0.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

featurewiz

Table of Contents

Background

Install

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes