PyTorch autoencoder with additional embeddings layer for categorical data.
Project description
The Autoembedder
Introduction
The Autoembedder is an autoencoder with additional embedding layers for the categorical columns. Its usage is flexible, and hyperparameters like the number of layers can be easily adjusted and tuned. The data provided for training can be either a path to a Dask or Pandas DataFrame stored in the Parquet format or the DataFrame object directly.
Installation
If you are using Poetry, you can install the package with the following command:
poetry add autoembedder
If you are using pip, you can install the package with the following command:
pip install autoembedder
installing dependencies
With Poetry:
poetry install
With pip:
pip install -r requirements.txt
Parameters
This is a list of all parameters that can be passed to the Autoembedder for training:
Argument | Type | Required (only for running using the training.py ) |
Default value | Comment |
---|---|---|---|---|
batch_size | int | False | 32 | |
drop_last | int | False | 1 | True/False |
pin_memory | int | False | 1 | True/False |
num_workers | int | False | 0 | 0 means that the data will be loaded in the main process |
use_mps | int | False | 0 | Set this to 1 if you want to use the MPS Backend for running on Mac using the M1 GPU. process |
model_title | str | False | autoembedder_{datetime}.bin | |
model_save_path | str | False | ||
n_save_checkpoints | int | False | ||
lr | float | False | 0.001 | |
amsgrad | int | False | 0 | True/False |
epochs | int | True | ||
layer_bias | int | False | 1 | True/False |
weight_decay | float | False | 0 | |
l1_lambda | float | False | 0 | |
xavier_init | int | False | 0 | True/False |
tensorboard_log_path | str | False | ||
trim_eval_errors | int | False | 0 | Removes the max and min loss when calculating the mean loss diff and median loss diff . This can be useful if some rows create very high losses. |
verbose | int | False | 0 | Set this to 1 if you want to see the model summary and the validation and evaluation results. set this to 2 if you want to see the training progress bar. 0 means no output. |
target | str | False | The target column. If not set no evaluation will be performed. | |
train_input_path | str | True | ||
test_input_path | str | True | ||
eval_input_path | False | True | Path to the evaluation data. If no path is provided no evaluation will be performed. | |
activation_for_code_layer | int | False | 0 | True/False, should the layer have an activation |
activation_for_final_decoder_layer | int | False | 0 | True/False, should the final decoder layer have an activation |
hidden_layer_representation | str | True | Contains a string representation of a list of list of integers which represents the hidden layer structure. E.g.: "[[64, 32], [32, 16], [16, 8]]" activation |
|
cat_columns | str | False | "[]" | Contains a string representation of a list of list of categorical columns (strings). The columns which use the same encoder should be together in a list. E.g.: "[['a', 'b'], ['c']]" . |
Run
Something like this should do it:
python3 training.py --epochs 20 \
--train_input_path "path/to/your/train_data" \
--test_input_path "path/to/your/test_data" \
--hidden_layer_representation "[[12, 6], [6, 3]]"
Why additional embedding layers?
The additional embedding layers automatically embed all columns with the Pandas category
data type. If categorical columns have another data type, they will not be embedded and will be handled like continuous columns. Simply encoding the categorical values (e.g., with the usage of a label encoder) decreases the quality of the outcome.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for autoembedder-0.1.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fbc55a60dd54300c41048cc9651b09727fa398cc697bda6e392832825c6c3af9 |
|
MD5 | 91ea84b61279d5fda9aff4355941600f |
|
BLAKE2b-256 | f1a59d67cdc2613f5e3a3c6236ee6617f2bceb4587907c9ee8998e4f7e95c36e |