scikit-learn compatible transformer that turns categorical features into dense numeric embeddings

These details have not been verified by PyPI

Project links

Homepage

Project description

Embedding Encoder

Overview

Embedding Encoder is a scikit-learn-compliant transformer that converts categorical variables to numeric vector representations. This is achieved by creating a small multilayer perceptron architecture in which each categorical variable is passed through an embedding layer, for which weights are extracted and turned into DataFrame columns.

Installation and dependencies

Embedding Encoder can be installed with

pip install embedding-encoder

Embedding Encoder has the following dependencies

scikit-learn
Tensorflow
numpy
pandas

Documentation

Full documentation including this readme and API reference can be found at RTD.

Usage

Embedding Encoder works like any scikit-learn transformer, the only difference being that it requires y to be passed as it is the neural network's target. By default it will convert categorical variables into integer arrays by applying scikit-learn's OrdinalEncoder.

Embedding Encoder will assume that all input columns are categorical and will calculate embeddings for each, unless the numeric_vars argument is passed. In that case, numeric variables will be included as an additional input to the neural network but no embeddings will be calculated for them, and they will not be included in the output transformation.

Please note that including numeric variables may reduce the interpretability of the final model as their total influence on the target variable can become difficult to disentangle.

The simplest usage example is

from embedding_encoder import EmbeddingEncoder

ee = EmbeddingEncoder(task="regression")
ee.fit(X=X, y=y)
output = ee.transform(X=X)

Compatibility with scikit-learn

Embedding Encoder can be included in pipelines as a regular transformer, and is compatible with cross-validation and hyperparameter optimization.

In the case of pipelines, if numeric_vars is specificed Embedding Encoder has to be the first step in the pipeline. This is because a Embedding Encoder with numeric_vars requires that its X input be a DataFrame with proper column names, which cannot be guaranteed if previous transformations are applied as is.

Alternatively, previous transformations can be included provided they are held inside the ColumnTransformerWithNames class in this library, which retains feature names.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

from embedding_encoder import EmbeddingEncoder
from embedding_encoder.compose import ColumnTransformerWithNames

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

ee = EmbeddingEncoder(task="classification", numeric_vars=numeric_vars)
num_pipe = make_pipeline(SimpleImputer(strategy="mean"), StandardScaler())
cat_transformer = SimpleImputer(strategy="most_frequent")
col_transformer = ColumnTransformerWithNames([("num_transformer", num_pipe, numeric_vars),
                                              ("cat_transformer", cat_transformer, categorical_vars)])

pipe = make_pipeline(col_transformer,
                     ee,
                     LogisticRegression())
pipe.fit(X_train, y_train)

Like scikit transformers, Embedding Encoder also has a inverse_transform method that recomposes the original input.

Advanced usage

Embedding Encoder gives some control over the neural network. In particular, its constructor allows setting how deep and large the network should be (by modifying layers_units), as well as the dropout rate between dense layers. Epochs and batch size can also be modified.

These can be optimized with regular scikit-learn hyperparameter optimization techiniques.

The training loop includes an early stopping callback that restores the best weights (by default, the ones that minimize the validation loss).

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.4

Mar 9, 2022

0.0.3

Feb 19, 2022

0.0.2

Jan 29, 2022

This version

0.0.1

Jan 27, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedding-encoder-0.0.1.tar.gz (11.5 kB view details)

Uploaded Jan 27, 2022 Source

Built Distribution

embedding_encoder-0.0.1-py3-none-any.whl (13.3 kB view details)

Uploaded Jan 27, 2022 Python 3

File details

Details for the file embedding-encoder-0.0.1.tar.gz.

File metadata

Download URL: embedding-encoder-0.0.1.tar.gz
Upload date: Jan 27, 2022
Size: 11.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for embedding-encoder-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`60a05c5abc49a34f57ab2a3d3f82ba49f5794c64356e659c7758fa5046ca200f`
MD5	`4ba474f378f61e126824fcff4f12937f`
BLAKE2b-256	`ca9a9d0da52a8e9183a4e25137f904ab79ca2fbf80075462bd18289528bf0eb9`

See more details on using hashes here.

File details

Details for the file embedding_encoder-0.0.1-py3-none-any.whl.

File metadata

Download URL: embedding_encoder-0.0.1-py3-none-any.whl
Upload date: Jan 27, 2022
Size: 13.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for embedding_encoder-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a4356d2c93203e31d6875c78616980acfc71c6b142e3316e43bc85886629ac93`
MD5	`29883d72868bfef0ee3b5f535fd0dbf9`
BLAKE2b-256	`fce5d2cd490fa966288c0b3094fe3efa50bd37d7b534765624b61aadc0b1c955`

See more details on using hashes here.

embedding-encoder 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Embedding Encoder

Overview

Installation and dependencies

Documentation

Usage

Compatibility with scikit-learn

Advanced usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes