PyTorch autoencoder with additional embeddings layer for categorical data.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

The autoembedder

The Autoembedder

python version

Introduction

The Autoembedder is an autoencoder with additional embedding layers for the categorical columns. Its usage is flexible, and hyperparameters like the number of layers can be easily adjusted and tuned. The data provided for training can be either a path to a Dask or Pandas DataFrame stored in the Parquet format or the DataFrame object directly.

Installation

If you are using Poetry, you can install the package with the following command:

poetry add autoembedder

If you are using pip, you can install the package with the following command:

pip install autoembedder

installing dependencies

With Poetry:

poetry install

With pip:

pip install -r requirements.txt

Parameters

This is a list of all parameters that can be passed to the Autoembedder for training:

Argument	Type	Required (only for running using the `training.py`)	Default value	Comment
batch_size	int	False	32
drop_last	int	False	1	True/False
pin_memory	int	False	1	True/False
num_workers	int	False	0	0 means that the data will be loaded in the main process
use_mps	int	False	0	Set this to `1` if you want to use the MPS Backend for running on Mac using the M1 GPU. process
model_title	str	False	autoembedder_{datetime}.bin
model_save_path	str	False
n_save_checkpoints	int	False
lr	float	False	0.001
amsgrad	int	False	0	True/False
epochs	int	True
layer_bias	int	False	1	True/False
weight_decay	float	False	0
l1_lambda	float	False	0
xavier_init	int	False	0	True/False
tensorboard_log_path	str	False
trim_eval_errors	int	False	0	Removes the max and min loss when calculating the `mean loss diff` and `median loss diff`. This can be useful if some rows create very high losses.
verbose	int	False	0	Set this to `1` if you want to see the model summary and the validation and evaluation results. set this to `2` if you want to see the training progress bar. `0` means no output.
target	str	False		The target column. If not set no evaluation will be performed.
train_input_path	str	True
test_input_path	str	True
eval_input_path	False	True		Path to the evaluation data. If no path is provided no evaluation will be performed.
activation_for_code_layer	int	False	0	True/False, should the layer have an activation
activation_for_final_decoder_layer	int	False	0	True/False, should the final decoder layer have an activation
hidden_layer_representation	str	True		Contains a string representation of a list of list of integers which represents the hidden layer structure. E.g.: `"[[64, 32], [32, 16], [16, 8]]"` activation
cat_columns	str	False	"[]"	Contains a string representation of a list of list of categorical columns (strings). The columns which use the same encoder should be together in a list. E.g.: `"[['a', 'b'], ['c']]"`.

Run

Something like this should do it:

python3 training.py --epochs 20 \
--train_input_path "path/to/your/train_data" \
--test_input_path "path/to/your/test_data" \
--hidden_layer_representation "[[12, 6], [6, 3]]"

Why additional embedding layers?

The additional embedding layers automatically embed all columns with the Pandas category data type. If categorical columns have another data type, they will not be embedded and will be handled like continuous columns. Simply encoding the categorical values (e.g., with the usage of a label encoder) decreases the quality of the outcome.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.2.5

Feb 7, 2023

0.2.4

Jan 21, 2023

0.2.3

Jan 4, 2023

0.2.2

Dec 28, 2022

0.2.1

Dec 17, 2022

0.2.0

Dec 13, 2022

0.1.20

Dec 11, 2022

0.1.15

Nov 10, 2022

0.1.14

Nov 9, 2022

0.1.13

Nov 9, 2022

0.1.12

Nov 9, 2022

0.1.11

Nov 8, 2022

0.1.10

Nov 8, 2022

0.1.9

Nov 7, 2022

0.1.8

Nov 4, 2022

0.1.7

Nov 4, 2022

0.1.6

Nov 4, 2022

0.1.5

Nov 4, 2022

0.1.4

Nov 4, 2022

This version

0.1.3

Nov 4, 2022

0.1.2

Nov 2, 2022

0.1.1

Nov 2, 2022

0.1

Nov 2, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoembedder-0.1.3.tar.gz (13.4 kB view hashes)

Uploaded Nov 4, 2022 Source

Built Distribution

autoembedder-0.1.3-py3-none-any.whl (15.1 kB view hashes)

Uploaded Nov 4, 2022 Python 3

Hashes for autoembedder-0.1.3.tar.gz

Hashes for autoembedder-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`ebd959fb2987082fd5252e6e735c327702a43e10ce6d855c6c5ebf7486f0085e`
MD5	`472584568c243b10c4908f7aa4a1c0e2`
BLAKE2b-256	`eb21374a97a4d3becafa8b187098645af45deeedac58ff6a2eeade7bb458ee40`

Hashes for autoembedder-0.1.3-py3-none-any.whl

Hashes for autoembedder-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fbc55a60dd54300c41048cc9651b09727fa398cc697bda6e392832825c6c3af9`
MD5	`91ea84b61279d5fda9aff4355941600f`
BLAKE2b-256	`f1a59d67cdc2613f5e3a3c6236ee6617f2bceb4587907c9ee8998e4f7e95c36e`