Skip to main content

PyTorch autoencoder with additional embeddings layer for categorical data.

Project description

The autoembedder

The Autoembedder

deploy package pypi python version docs license downloads mypy black isort pre-commit

Introduction

The Autoembedder is an autoencoder with additional embedding layers for the categorical columns. Its usage is flexible, and hyperparameters like the number of layers can be easily adjusted and tuned. The data provided for training can be either a path to a Dask or Pandas DataFrame stored in the Parquet format or the DataFrame object directly.

Installation

If you are using Poetry, you can install the package with the following command:

poetry add autoembedder

If you are using pip, you can install the package with the following command:

pip install autoembedder

installing dependencies

With Poetry:

poetry install

With pip:

pip install -r requirements.txt

Parameters

This is a list of all parameters that can be passed to the Autoembedder for training:

Argument Type Required (only for running using the training.py) Default value Comment
batch_size int False 32
drop_last int False 1 True/False
pin_memory int False 1 True/False
num_workers int False 0 0 means that the data will be loaded in the main process
use_mps int False 0 Set this to 1 if you want to use the MPS Backend for running on Mac using the M1 GPU. process
model_title str False autoembedder_{datetime}.bin
model_save_path str False
n_save_checkpoints int False
lr float False 0.001
amsgrad int False 0 True/False
epochs int True
layer_bias int False 1 True/False
weight_decay float False 0
l1_lambda float False 0
xavier_init int False 0 True/False
tensorboard_log_path str False
trim_eval_errors int False 0 Removes the max and min loss when calculating the mean loss diff and median loss diff. This can be useful if some rows create very high losses.
verbose int False 0 Set this to 1 if you want to see the model summary and the validation and evaluation results. set this to 2 if you want to see the training progress bar. 0 means no output.
target str False The target column. If not set no evaluation will be performed.
train_input_path str True
test_input_path str True
eval_input_path False True Path to the evaluation data. If no path is provided no evaluation will be performed.
activation_for_code_layer int False 0 True/False, should the layer have an activation
activation_for_final_decoder_layer int False 0 True/False, should the final decoder layer have an activation
hidden_layer_representation str True Contains a string representation of a list of list of integers which represents the hidden layer structure. E.g.: "[[64, 32], [32, 16], [16, 8]]" activation
cat_columns str False "[]" Contains a string representation of a list of list of categorical columns (strings). The columns which use the same encoder should be together in a list. E.g.: "[['a', 'b'], ['c']]".

Run

Something like this should do it:

python3 training.py --epochs 20 \
--train_input_path "path/to/your/train_data" \
--test_input_path "path/to/your/test_data" \
--hidden_layer_representation "[[12, 6], [6, 3]]"

Why additional embedding layers?

The additional embedding layers automatically embed all columns with the Pandas category data type. If categorical columns have another data type, they will not be embedded and will be handled like continuous columns. Simply encoding the categorical values (e.g., with the usage of a label encoder) decreases the quality of the outcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoembedder-0.1.3.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

autoembedder-0.1.3-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file autoembedder-0.1.3.tar.gz.

File metadata

  • Download URL: autoembedder-0.1.3.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.9.15 Linux/5.15.0-1022-azure

File hashes

Hashes for autoembedder-0.1.3.tar.gz
Algorithm Hash digest
SHA256 ebd959fb2987082fd5252e6e735c327702a43e10ce6d855c6c5ebf7486f0085e
MD5 472584568c243b10c4908f7aa4a1c0e2
BLAKE2b-256 eb21374a97a4d3becafa8b187098645af45deeedac58ff6a2eeade7bb458ee40

See more details on using hashes here.

File details

Details for the file autoembedder-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: autoembedder-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 15.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.9.15 Linux/5.15.0-1022-azure

File hashes

Hashes for autoembedder-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fbc55a60dd54300c41048cc9651b09727fa398cc697bda6e392832825c6c3af9
MD5 91ea84b61279d5fda9aff4355941600f
BLAKE2b-256 f1a59d67cdc2613f5e3a3c6236ee6617f2bceb4587907c9ee8998e4f7e95c36e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page