Train a transformer model with the command line
Project description
one-to-one and many-to-one autoregression made easy
Sequifier enables sequence classification or regression for time based sequences using transformer models, via CLI. The specific configuration of preprocessing, which takes a single or multi-variable columnar data file and creates training, validation and test sequences, training, which trains a transformer model, and inference, which calculates model outputs for data (usually the test data from preprocessing), is done via configuration yaml files.
\
Overview
The sequifier package enables:
- the extraction of sequences for training
- the configuration and training of a transformer classification or regression model
- using multiple input and output sequences
- inference on data with a trained model
Other materials
If you want to first get a more specific understanding of the transformer architecture, have a look at the Wikipedia article.
If you want to see a benchmark on a small synthetic dataset with 10k cases, agains a random forest, an xgboost model and a logistic regression, check out this notebook.
Complete example how to build and apply a transformer sequence classifier with sequifier
- create a conda environment with python >=3.9 activate and run
pip install sequifier
- create a new project folder (at a path referred to as
PROJECT PATH
later) and aconfigs
subfolder - copy default configs from repository for preprocessing, training and inference
- adapt preprocess config to take the path to the data you want to preprocess and set
project_path
toPROJECT PATH
- run
sequifier preprocess --config_path=[PROJECT PATH]/configs/preprocess.yaml
- the preprocessing step outputs a "data driven config" at
[PROJECT PATH]/configs/ddconfigs/[FILE NAME]
. It contains the number of classes found in the data, a map of classes to indices and the oaths to train, validation and test splits of data. Adapt thedd_config
parameter intrain-on-preprocessed.yaml
andinfer.yaml
in to the path[PROJECT PATH]/configs/ddconfigs/[FILE NAME]
and setproject_path
toPROJECT PATH
in both configs - run
sequifier train --config_path=[PROJECT PATH]/configs/train-on-preprocessed.yaml
- adapt
data_path
ininfer.yaml
- run
sequifier infer --config_path=[PROJECT PATH]/configs/infer.yaml
- find your predictions at
[PROJECT PATH]/outputs/predictions/sequifier-default-best-predictions.csv
More detailed explanations of the three steps
Preprocessing of data into sequences for training
The preprocessing step is designed for scenarios where for timeseries or timeseries-like data, the prediction of the next data point of one or more variables from prior values of these variables and (optionally) other variables is of interest.
This step presupposes input data with three columns: "sequenceId" and "itemPosition", and a column with the variable that is the prediction target. "sequenceId" separates different sequences and the itemPosition column provides values that enable sequential sorting. Often this will simply be a timestamp. You can find an example of the preprocessing input data at documentation/example_inputs/preprocessing_input.csv
The data can then be processed and split into training, validation and testing datasets of all valid subsequences in the original data with the command:
sequifier preprocess --config_path=[CONFIG PATH]
The config path specifies the path to the preprocessing config and the project path the path to the (preferably empty) folder the output files of the different steps are written to.
The default config can be found on this path:
Configuring and training the sequence classification model
The training step is executed with the command:
sequifier train --config_path=[CONFIG PATH]
If the data on which the model is trained DOES NOT come from the preprocessing step, the flag
--on-unprocessed
should be added.
If the training data does not come from the preprocessing step, both train and validation data have to take the form of a csv file with the columns "sequenceId", "subsequenceId", "col_name", [SEQ LENGTH], [SEQ LENGTH - 1],...,"1", "target". You can find an example of the preprocessing input data at documentation/example_inputs/training_input.csv
The training step is configured using the config. The two default configs can be found here:
depending on whether the preprocessing step was executed.
Inferring on test data using the trained model
Inference is done using the command:
sequifier infer --config_path=[CONFIG PATH]
and configured using a config file. The default version can be found here:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for sequifier-0.3.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d4ad8441ca361d39333f68f419127cdfcb18afcd805f66eea87598eeeac7c860 |
|
MD5 | 31a43f015a8116edc926988b60b0bc29 |
|
BLAKE2b-256 | c3f1bc29e006349181b15ce2dd3be45b489b2d0062d07e0f3a6d014b3b7b36aa |