Skip to main content

Train a transformer model with the command line

Project description

one-to-one and many-to-one autoregression made easy

Sequifier enables sequence classification or regression for time based sequences using transformer models, via CLI. The specific configuration of preprocessing, which takes a single or multi-variable columnar data file and creates training, validation and test sequences, training, which trains a transformer model, and inference, which calculates model outputs for data (usually the test data from preprocessing), is done via configuration yaml files.



\

Overview

The sequifier package enables:

  • the extraction of sequences for training
  • the configuration and training of a transformer classification or regression model
  • using multiple input and output sequences
  • inference on data with a trained model

Other materials

If you want to first get a more specific understanding of the transformer architecture, have a look at the Wikipedia article.

If you want to see a benchmark on a small synthetic dataset with 10k cases, agains a random forest, an xgboost model and a logistic regression, check out this notebook.

Complete example how to build and apply a transformer sequence classifier with sequifier

  1. create a conda environment with python >=3.9 activate and run
pip install sequifier
  1. To create the project folder with the config templates in the configs subfolder, run
sequifier make YOUR_PROJECT_NAME
  1. cd into the YOUR_PROJECT_NAME folder, create a data folder and add your data and adapt the config file preprocess.yaml in the configs folder to take the path to the data
  2. run
sequifier preprocess
  1. the preprocessing step outputs a "data driven config" at configs/ddconfigs/[FILE NAME]. It contains the number of classes found in the data, a map of classes to indices and the oaths to train, validation and test splits of data. Adapt the dd_config parameter in train.yaml and infer.yaml in to the path configs/ddconfigs/[FILE NAME]
  2. Adapt the config file train.yaml to specify the transformer hyperparameters you want and run
sequifier train
  1. adapt data_path in infer.yaml to one of the files output in the preprocessing step
  2. run
sequifier infer
  1. find your predictions at [PROJECT PATH]/outputs/predictions/sequifier-default-best-predictions.csv

More detailed explanations of the three steps

Preprocessing of data into sequences for training

The preprocessing step is designed for scenarios where for timeseries or timeseries-like data, the prediction of the next data point of one or more variables from prior values of these variables and (optionally) other variables is of interest.

This step presupposes input data with three columns: "sequenceId" and "itemPosition", and a column with the variable that is the prediction target. "sequenceId" separates different sequences and the itemPosition column provides values that enable sequential sorting. Often this will simply be a timestamp. You can find an example of the preprocessing input data at documentation/example_inputs/preprocessing_input.csv

The data can then be processed and split into training, validation and testing datasets of all valid subsequences in the original data with the command:

sequifier preprocess --config_path=[CONFIG PATH]

The config path specifies the path to the preprocessing config and the project path the path to the (preferably empty) folder the output files of the different steps are written to.

The default config can be found on this path:

configs/preprocess.yaml

Configuring and training the sequence classification model

The training step is executed with the command:

sequifier train --config_path=[CONFIG PATH]

If the data on which the model is trained DOES NOT come from the preprocessing step, the flag

--on-unprocessed

should be added.

If the training data does not come from the preprocessing step, both train and validation data have to take the form of a csv file with the columns "sequenceId", "subsequenceId", "col_name", [SEQ LENGTH], [SEQ LENGTH - 1],...,"1", "0". You can find an example of the preprocessing input data at documentation/example_inputs/training_input.csv

The training step is configured using the config. The two default configs can be found here:

configs/train.yaml

depending on whether the preprocessing step was executed.

Inferring on test data using the trained model

Inference is done using the command:

sequifier infer --config_path=[CONFIG PATH]

and configured using a config file. The default version can be found here:

configs/infer.yaml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sequifier-0.4.1.0.tar.gz (32.0 kB view details)

Uploaded Source

Built Distribution

sequifier-0.4.1.0-py3-none-any.whl (39.9 kB view details)

Uploaded Python 3

File details

Details for the file sequifier-0.4.1.0.tar.gz.

File metadata

  • Download URL: sequifier-0.4.1.0.tar.gz
  • Upload date:
  • Size: 32.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.7 Darwin/22.6.0

File hashes

Hashes for sequifier-0.4.1.0.tar.gz
Algorithm Hash digest
SHA256 731d59a93acf657f877760adec88b9a7d3a4a065e092b8767538aa2515a9481a
MD5 4e9e980ec768160e480610b77551595d
BLAKE2b-256 cf0011218b65efe35bfa2d7f7741864fdeee5ed8a67a0602f3dc29088db947a6

See more details on using hashes here.

File details

Details for the file sequifier-0.4.1.0-py3-none-any.whl.

File metadata

  • Download URL: sequifier-0.4.1.0-py3-none-any.whl
  • Upload date:
  • Size: 39.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.7 Darwin/22.6.0

File hashes

Hashes for sequifier-0.4.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9d4e5e9c9025f8178fbd73c29f073c11b85b5edf13280e65b328e1cc297f109e
MD5 7721b7d741f88104b7af261bf167aee0
BLAKE2b-256 4ecd637003d7c74db858f185a7dced68d18a99178a175404a2426e1893a70be0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page