Skip to main content

PyTorch autoencoder with additional embeddings layer for categorical data.

Project description

The autoembedder

The Autoembedder

deploy package Codacy Badge pypi python version docs license downloads mypy black isort pre-commit

Introduction

The Autoembedder is an autoencoder with additional embedding layers for the categorical columns. Its usage is flexible, and hyperparameters like the number of layers can be easily adjusted and tuned. The data provided for training can be either a path to a Dask or Pandas DataFrame stored in the Parquet format or the DataFrame object directly.

Installation

If you are using Poetry, you can install the package with the following command:

poetry add autoembedder

If you are using pip, you can install the package with the following command:

pip install autoembedder

Installing dependencies

With Poetry:

poetry install

With pip:

pip install -r requirements.txt

Usage

0. Some imports

from autoembedder import Autoembedder, dataloader, fit

1. Create dataloaders

First, we create two dataloaders. One for training, and the other for validation data. As source they either accept a path to a Parquet file, to a folder of Parquet files or a Pandas/Dask DataFrame.

train_dl = dataloader(train_df)
valid_dl = dataloader(vaild_df)

2. Set parameters

Now, we need to set the parameters. They are going to be used for handling the data and training the model. In this example, only parameters for the training are set. Here you find a list of all possible parameters. This should do it:

parameters = {
    "hidden_layers": [[25, 20], [20, 10]],
    "epochs": 10,
    "lr": 0.0001,
    "verbose": 1,
}

3. Initialize the autoembedder

Then, we need to initialize the autoembedder. In this example, we are not using any categorical features. So we can skip the embedding_sizes argument.

model = Autoembedder(parameters, num_cont_features=train_df.shape[1])

4. Train the model

Everything is set up. Now we can fit the model.

fit(parameters, model, train_dl, valid_dl)

Example

Check out this Jupyter notebook for an applied example using the Credit Card Fraud Detection from Kaggle.

Parameters

This is a list of all parameters that can be passed to the Autoembedder for training. When using the training script the _ needs to be replaced with - and the parameters need to be passed as arguments. For boolean values please have a look at the Comment column for understanding how to pass them.

Run the training script

You can also simply use the training script::

python3 training.py \
--epochs 20 \
--train-input-path "path/to/your/train_data" \
--test-input-path "path/to/your/test_data" \
--hidden-layers "[[12, 6], [6, 3]]"

for help just run:

python3 training.py --help
Argument Type Required Default value Comment
batch_size int False 32
drop_last bool False True --drop-last / --no-drop-last
pin_memory bool False True --pin-memory / --no-pin-memory
num_workers int False 0 0 means that the data will be loaded in the main process
use_mps bool False False --use-mps / --no-use-mps
model_title str False autoembedder_{datetime}.bin
model_save_path str False
n_save_checkpoints int False
lr float False 0.001
amsgrad bool False False --amsgrad / --no-amsgrad
epochs int True
dropout_rate float False 0 Dropout rate for the dropout layers in the encoder and decoder.
layer_bias bool False True --layer-bias / --no-layer-bias
weight_decay float False False
l1_lambda float False 0
xavier_init bool False False --xavier-init / --no-xavier-init
activation str False tanh Activation function; either tanh, relu, leaky_relu or elu
tensorboard_log_path str False
trim_eval_errors bool False False --trim-eval-errors / --no-trim-eval-errors; Removes the max and min loss when calculating the mean loss diff and median loss diff. This can be useful if some rows create very high losses.
verbose int False 0 Set this to 1 if you want to see the model summary and the validation and evaluation results. set this to 2 if you want to see the training progress bar. 0 means no output.
target str False The target column. If not set no evaluation will be performed.
train_input_path str True
test_input_path str True
eval_input_path str False Path to the evaluation data. If no path is provided no evaluation will be performed.
hidden_layers str True Contains a string representation of a list of list of integers which represents the hidden layer structure. E.g.: "[[64, 32], [32, 16], [16, 8]]" activation
cat_columns str False "[]" Contains a string representation of a list of list of categorical columns (strings). The columns which use the same encoder should be together in a list. E.g.: "[['a', 'b'], ['c']]".
drop-cat-columns bool False --drop-cat-columns / --no-drop-cat-columns

Why additional embedding layers?

The additional embedding layers automatically embed all columns with the Pandas category data type. If categorical columns have another data type, they will not be embedded and will be handled like continuous columns. Simply encoding the categorical values (e.g., with the usage of a label encoder) decreases the quality of the outcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoembedder-0.2.5.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

autoembedder-0.2.5-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file autoembedder-0.2.5.tar.gz.

File metadata

  • Download URL: autoembedder-0.2.5.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.10.6 Linux/5.15.0-1031-azure

File hashes

Hashes for autoembedder-0.2.5.tar.gz
Algorithm Hash digest
SHA256 2dcf96f43bd1fe36d16266e0b543580f89a8a3ae47221f7a393233389cd367de
MD5 55aa2467c5e7e0d68d55df3cf7e5d6de
BLAKE2b-256 4b5ee948c6630c401f8e0943dfec163839fe64771c657507ba7cbaed708ff06d

See more details on using hashes here.

File details

Details for the file autoembedder-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: autoembedder-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.10.6 Linux/5.15.0-1031-azure

File hashes

Hashes for autoembedder-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 ca75b38438d4d888bac225b1ea1ce5b0920a7c9b290bdfefedc514d681a3889e
MD5 696b47361f35f483d0f97ca6c2608b63
BLAKE2b-256 8b6214781dd9f731e01e4b0cb46c18eca16b450467d4bd78b03ba60dd67e5f11

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page