sequifier

Train a transformer model with the command line

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

easy sequence classification training and inference with transformers

Overview

The sequifier package enables:

the extraction of sequences for training
the configuration and training of a transformer classification model
inference on data with a trained model

Other materials

If you want to first get a more specific understanding of the transformer architecture, have a look at the Wikipedia article.

If you want to see a benchmark on a small synthetic dataset with 10k cases, agains a random forest, an xgboost model and a logistic regression, check out this notebook.

Complete example how to build and apply a transformer sequence classifier with sequifier

create a conda environment with python 3.9.12, activate and run

pip install sequifier

create a new project folder (at a path referred to as PROJECT PATH later) and a configs subfolder
copy default configs from repository for preprocessing, training and inference
adapt preprocess config to take the path to the data you want to preprocess and set project_path toPROJECT PATH
run

sequifier --preprocess --config_path=[PROJECT PATH]/configs/preprocess.yaml

the preprocessing step outputs a "data driven config" at [PROJECT PATH]/configs/ddconfigs/[FILE NAME]. It contains the number of classes found in the data, a map of classes to indices and the oaths to train, validation and test splits of data. Adapt the dd_config parameter in train-on-preprocessed.yaml and infer.yaml in to the path [PROJECT PATH]/configs/ddconfigs/[FILE NAME]and set project_path to PROJECT PATH in both configs
run

sequifier --train --on-preprocessed --config_path=[PROJECT PATH]/configs/train-on-preprocessed.yaml

adapt inference_data_path in infer.yaml
run

sequifier --infer --config_path=[PROJECT PATH]/configs/infer.yaml

find your predictions at [PROJECT PATH]/outputs/predictions/sequifier-default-best_predictions.csv

More detailed explanations of the three steps

Preprocessing of data into sequences for training

The preprocessing step is designed for scenarios where for long series of events, the prediction of the next event from the previous N events is of interest. In cases of sequences where only the last item is a valid target, the preprocessing step should not be executed.

This step presupposes input data with three columns: "sequenceId", "itemId" and "timesort". "sequenceId" and "itemId" identify sequence and item, and the timesort column must provide values that enable sequential sorting. Often this will simply be a timestamp. You can find an example of the preprocessing input data at documentation/example_inputs/preprocessing_input.csv

The data can then be processed and split into training, validation and testing datasets of all valid subsequences in the original data with the command:

sequifier --preprocess --config_path=[CONFIG PATH]

The config path specifies the path to the preprocessing config and the project path the path to the (preferably empty) folder the output files of the different steps are written to.

The default config can be found on this path:

configs/preprocess.yaml

Configuring and training the sequence classification model

The training step is executed with the command:

sequifier --train --config_path=[CONFIG PATH]

If the data on which the model is trained comes from the preprocessing step, the flag

--on-preprocessed

should also be added.

If the training data does not come from the preprocessing step, both train and validation data have to take the form of a csv file with the columns "sequenceId", [SEQ LENGTH], [SEQ LENGTH - 1],...,"1", "target". You can find an example of the preprocessing input data at documentation/example_inputs/training_input.csv

The training step is configured using the config. The two default configs can be found here:

configs/train.yaml

configs/train-on-preprocessed.yaml

depending on whether the preprocessing step was executed.

Inferring on test data using the trained model

Inference is done using the command:

sequifier --infer --config_path=[CONFIG PATH]

and configured using a config file. The default version can be found here:

configs/infer.yaml

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

2.2.1

May 30, 2024

2.2.0

May 26, 2024

2.1.0

May 1, 2024

2.0.0

Apr 21, 2024

1.1.4

Mar 19, 2024

1.1.3

Mar 19, 2024

1.1.2

Mar 18, 2024

1.1.1

Mar 18, 2024

1.1.0

Mar 18, 2024

1.0.1

Mar 18, 2024

1.0.0

Feb 29, 2024

0.3.2

Mar 4, 2023

0.3.1

Mar 3, 2023

This version

0.3.0

Mar 1, 2023

0.2.9

Feb 18, 2023

0.2.8

Feb 18, 2023

0.2.7

Jan 30, 2023

0.2.6

Jan 29, 2023

0.2.5

Jan 29, 2023

0.2.4

Jan 29, 2023

0.2.3

Jan 28, 2023

0.2.2

Jan 28, 2023

0.2.1

Jan 24, 2023

0.2.0

Jan 21, 2023

0.1.0

Jan 21, 2023

0.0.1

Jan 21, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sequifier-0.3.0.tar.gz (13.9 kB view hashes)

Uploaded Mar 1, 2023 Source

Built Distribution

sequifier-0.3.0-py3-none-any.whl (15.5 kB view hashes)

Uploaded Mar 1, 2023 Python 3

Hashes for sequifier-0.3.0.tar.gz

Hashes for sequifier-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`c9f608eac53b4d4163c6c0d2e5c4dd00e4285b83bfadd6b8f77186cb911cda10`
MD5	`cbbf388b85bb68d58e0d63f1a2f8ecc3`
BLAKE2b-256	`03e25bd457497c5ad2f8c18e08a1ca47633fda4d4d820d04c7ff03d02b9d9816`

Hashes for sequifier-0.3.0-py3-none-any.whl

Hashes for sequifier-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d020443624d381356bad9a84894ad504ef5e3d6e15d594c9e05e79bbc34f2928`
MD5	`dab7d861c93f701771f628985b6aa5d6`
BLAKE2b-256	`96410cca821f7cee4f4d59729c82aec32061d6bce1a8d2fd12e42fe468ecdc0f`