Automatically build data and model pipelines using scikit-learn in a single line of code

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

lazytransform

Automatically transform all categorical, date-time, NLP variables in your data set to numeric in a single line of code for any data set any size.

banner

What is lazytransform
How to use lazytransform
How to install lazytransform
Usage
API
Maintainers
Contributing
License

What is lazytransform?

lazytransform is a new python library for automatically transforming your entire dataset to numeric format using category encoders, NLP text vectorizers and pandas date time processing functions. All in a single line of code!

lazytransform can be used in one or two ways. Both are explained below.

## How to use lazytransform

1. Using lazytransform as a simple pandas data transformation pipeline

The first method is probably the most popular way to use lazytransform. The transformer within lazytransform can be used to transform and create new features from categorical, date-time and NLP (text) features in your dataset. This transformer pipeline is fully scikit-learn Pipeline compatible and can be used to build even more complex pipelines by you based on `make_pipeline` statement from `sklearn.pipeline` library. Let us see an example:

lazy_code1

2. Using lazytransform as a sklearn pipeline with sklearn models or XGBoost or LightGBM models

The second method is a great way to create an entire data transform and model training pipeline. `lazytransform` allows you to send in a model object (only the following are supported) and it will automatically transform, create new features and train a model using sklearn pipelines. This method can be seen as follows:

![lazy_code2](lazy_code2.png)

The following models are currently supported:

All sklearn models
All MultiOutput models from sklearn.multioutput library
XGboost models
LightGBM models

However, you must install and import those models on your own and define them as model variables before passing those variables to lazytransform.

## How to install lazytransform

**Prerequsites:**

lazytransform is built using pandas, numpy, scikit-learn, category_encoders and imb-learn libraries. It should run on most Python 3 Anaconda installations without additional installs. You won't have to import any special libraries other than "imb-learn" and "category_encoders".

- [Anaconda](https://docs.anaconda.com/anaconda/install/)

To install from PyPi:

pip install lazytransform 
or
pip install git+https://github.com/AutoViML/lazytransform.git

To install from source:

cd <lazytransform_Destination>
git clone git@github.com:AutoViML/lazytransform.git
# or download and unzip https://github.com/AutoViML/lazytransform/archive/master.zip
conda create -n <your_env_name> python=3.7 anaconda
conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
cd lazytransform
pip install -r requirements.txt

Usage

You can invoke lazytransform as a scikit-learn compatible fit and transform or a fit and predict pipeline. See syntax below. ``` lazy = LazyTransformer(model=False, encoders='auto', scalers=None, date_to_string=False, transform_target=False, imbalanced=False) ```

if you are not using a model in pipeline, you must use fit and transform

X_trainm, y_trainm = lazy.fit_transform(X_train, y_train)
X_testm = lazy.transform(X_test)

If using a model in pipeline, use must use fit and predict only

lazy = LazyTransformer(model=RandomForestClassifier(), encoders='auto', scalers=None, 
        date_to_string=False, transform_target=False, imbalanced=False)

lazy.fit(X_train, y_train)
lazy.predict(X_test)

API

lazytransform has a very simple API with the following inputs. You need to create a sklearn-compatible transformer pipeline object by importing LazyTransformer from lazytransform library.

Once you import it, you can define the object by giving several options such as:

Arguments

model: could be any scikit-learn model (including multioutput models) as well as the popular XGBoost and LightGBM libraries.
encoders: could be one more encoders in a string or a list. Each encoder string can be any one of the 10+ encoders from category_encoders library below. Available encoders are listed here as strings so that you can input them in lazytransform:
- auto - It uses onehot encoding for low-cardinality variables and label encoding for high cardinality variables.
- onehot - One Hot encoding - it will be used for all categorical features irrespective of cardinality
- label - Label Encoding - it will be used for all categorical features irrespective of cardinality
- hashing or hash - Hashing (or Hash) Encoding
- helmert - Helmert Encoding
- bdc - BDC Encoding
- sum - Sum Encoding
- loo - Leave one out Encoding
- base - Base encoding
- woe - Weight of Evidence Encoding
- james - James Encoding
- target - Target Encoding
- count - Count Encoding
- glm,glmm - Generalized Linear Model Encoding
scalers: could be one of three main scalers used in scikit-learn models to transform numeric features. Default is None. Scalers are used in the last step of the pipeline to scale all features that have transformed. However, you might want to avoid scaling in NLP datasets since after TFiDF vectorization, scaling them may not make sense. But it is up to you. The 4 options are:
- None No scaler. Great for almost all datasets. Test it first and then try one of the scalers below.
- std standard scaler. Great for almost all datasets.
- minmax minmax scaler. Great for datasets where you need to see the distribution between 0 and 1.
- maxabs max absolute scaler. Great for scaling but leaves the negative values as they are (negative).
date_to_string: default is False. If you want to use date variables as strings (categorical), then set it as True.You can use this option when there are very few dates in your dataset. If you set it as False, it will convert it into date time format and extract up to 20 features from your date time column. This is the default option and best option.
transform_target: default is False. If you want to transform your target variable(s), then set it as True and we will transform your target(s) as numeric using Label Encoding as well as multi-label Binary classes. This is a great option when you have categorical target variables.
imbalanced: default is False. If you have an imbalanced dataset, then set it to True and we will transform your train data using BorderlineSMOTE or SMOTENC which are both great options. We will select the right SMOTE function automatically.
verbose: This has 3 possible states:
- 0 silent output. Great for running this silently and getting fast results.
- 1 more verbiage. Great for knowing how results were and making changes to flags in input.
- 2 highly verbose output. Great for finding out what happens under the hood in lazytransform pipelines.

Maintainers

@AutoViML

Contributing

See the contributing file!

PRs accepted.

License

DISCLAIMER

This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.16

Feb 10, 2024

1.15

Feb 2, 2024

1.14

Jan 26, 2024

1.13

Jan 23, 2024

1.12

Jan 16, 2024

1.10

Dec 5, 2023

1.9

Dec 3, 2023

1.8

Nov 29, 2023

1.7

Nov 27, 2023

1.5

Oct 31, 2023

1.4

Oct 30, 2023

1.3

Oct 5, 2023

1.2

Oct 5, 2023

1.1

May 17, 2023

1.0

Sep 27, 2022

0.99

Sep 23, 2022

0.98

Sep 21, 2022

0.97

Sep 18, 2022

0.96

Sep 8, 2022

0.95

Aug 21, 2022

0.94

Aug 21, 2022

0.93

Aug 21, 2022

0.92

Aug 21, 2022

0.91

Aug 21, 2022

0.81

Jul 26, 2022

0.80

Jul 11, 2022

0.79

Jul 6, 2022

0.78

Jul 5, 2022

0.76

Jun 30, 2022

0.75

Jun 30, 2022

0.73

Jun 27, 2022

0.72

Jun 27, 2022

0.71

Jun 27, 2022

0.61 yanked

Jun 23, 2022

Reason this release was yanked:

not uploading new ones

0.60

Jun 15, 2022

0.51

Apr 25, 2022

0.50

Apr 21, 2022

0.43

Apr 17, 2022

0.42

Apr 13, 2022

0.41

Apr 7, 2022

0.40

Apr 7, 2022

0.32

Apr 6, 2022

0.31

Apr 6, 2022

0.30

Apr 5, 2022

0.29

Apr 5, 2022

0.28

Apr 3, 2022

0.27

Apr 3, 2022

0.26

Apr 3, 2022

0.25

Apr 3, 2022

This version

0.24

Apr 3, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lazytransform-0.24.tar.gz (4.8 kB view hashes)

Uploaded Apr 3, 2022 Source

Built Distribution

lazytransform-0.24-py3-none-any.whl (8.3 kB view hashes)

Uploaded Apr 3, 2022 Python 3

Hashes for lazytransform-0.24.tar.gz

Hashes for lazytransform-0.24.tar.gz
Algorithm	Hash digest
SHA256	`7140383f2a1458fc6337c8c6722c5967c2b75f5bf18945a4868924690b635c51`
MD5	`2fa105907f5907d8613337a8a3c95cc5`
BLAKE2b-256	`5302ece03a2800d669cc3d2ef31256462fa9cc74749e85f54b217c757c1aa9e3`

Hashes for lazytransform-0.24-py3-none-any.whl

Hashes for lazytransform-0.24-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5cbfe9d30a897f27539a0716b3686c89ec9d7ed8f1a64157bc1624b10f522028`
MD5	`ff1212119e081f7aa73a26db265c75fc`
BLAKE2b-256	`750879e5a8dacef217d58dec5bb21da3196dd527a0c9200250b9d1dfe60046ec`