Scheduled training for machine translation systems

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

OpusTrainer

The purpose of the trainer is to provide the user with a flexible way of scheduling various sources of input data, as well as augment the training data with tittle casing, all caps, etc. This is particularly useful when you have multiple data sources and you want to pretrain the model first on backtranslated data, gradually add other sources of data, and finally fine tune, all in one go.

Alternatively, this tool is particularly suited to training multilingual models, as it provides an easy way to define the desired mixture of datasets from different language sources.

Installation

You've got two options: Install directly from PyPI:

pip install opustrainer

or clone this repository, and install it in editable mode so you can change the source, but still use all the commands:

git clone git@github.com:hplt-project/opustrainer.git
cd opustrainer
pip install -e .

Usage

% ./trainer.py --help
usage: trainer.py [-h] --config CONFIG [--temporary-directory TEMPORARY_DIR] [--state STATE_FILE] [--do-not-resume] [--sync] [trainer-command [arguments]]

Feeds marian tsv data for training.

options:
  -h, --help            show this help message and exit
  --config CONFIG, -c CONFIG
                        YML configuration input.
  --temporary-directory TEMPORARY_DIR, -t TEMPORARY_DIR
                        Temporary dir, used for shuffling and tracking state
  --state STATE_FILE    Path to trainer state file which stores how much of
                        each dataset has been read. Defaults to ${CONFIG}.state
  --sync                Do not shuffle in the background
  --do-not-resume, -d   Do not resume from the previous training state
  --no-shuffle, -n      Do not shuffle, for debugging

Once you fix the paths in the configuration file, train_config.yml you can run a test case by doing:

./trainer.py -c train_config.yml /path/to/marian -c marian_config --any --other --flags

You can check resulting mixed file in /tmp/test. If your neural network trainer doesn't support training from stdin, you can use this tool to generate a training dataset and then disable data reordering or shuffling at your trainer implementation, as your training input should be balanced.

At the start of the training all datasets are shuffled. Each time a dataset's end is reached, it is re-shuffled. Shuffling in the system temp directory but can be repositioned using --temporary-directory or the TMPDIR environment variable. By default, the training state is kept in the same place as the configuration file. If training is interrupted, re-running the trainer should resume from where it was (depending on how much your neural network trainer has buffered, that part will be skipped).

Configuration file

Define your training process via a configuration file. You define the datasets on top, the stages and then for each stage a mixing criteria and a stage termination criteria. An example configuration file is provided below. The path to the trainer is a path to any neural network trainer that supports having stdin as training input format.

# Datasets are already TSV files. We support reading gzip'd files, as well as multiple dataset file per name
datasets:
  clean: test/data/clean
  medium: test/data/medium
  dirty: test/data/dirty

stages:
  - start
  - mid
  - end

start:
  - clean 0.8
  - medium 0.2
  - dirty 0
  - until clean 2 # Until two epochs of clean

mid:
  - clean 0.6
  - medium 0.3
  - dirty 0.1
  - until medium 1

end:
  - clean 0.4
  - medium 0.3
  - dirty 0.3
  - until dirty 5 # use `inf` to mean until forever

modifiers:
- UpperCase: 0.05 # Apply uppercase randomly to 5% of sentences. See below
- TitleCase: 0.05

seed: 1111
trainer: /path/to/trainer/run.py

Extended stage configuration

If you want to change which modifiers are used for a specific stage, you can the extended stage configuration format. If a modifiers is mentioned here, it will override the curriculum-wide defined modifiers for just this stage.

In the extended format, the list of datasets is defined in the mix key. You can optionally add a modifiers key. For example:

start:
  mix:
  - clean 0.8
  - medium 0.2
  - dirty 0
  - until clean 2 # Until two epochs of clean
  modifiers:
    - UpperCase: 0.05
    - TitleCase: 0.05

Note that you can use YAML references if you wish to extensively combine global and local modifiers.

Modifiers

Modifiers are randomly applied to the sentences that go into the trainer. Each modifier has a probability associated with it that is the chance that a sentence is modified by the modifier. E.g. a modifier with a probability of 0.05 will affect about 1 in every 20 sentences.

Modifiers are applied one after another, in the order you defined them, all with their own probability regardless of the modifiers that got applied before it. E.g. if you have the following configuration:

modifiers:
- UpperCase: 0.05
- TitleCase: 0.05

This means that 1 in 20 sentences will be uppercased, and 1 in 20 will be titlecased. And effectively 0.05 * 0.05 so 1 in 400 will first be uppercased and then titlecased.

UpperCase

Turns the entire source and target sentence to upper case, e.g. 'heLLo' becomes 'HELLO'.

modifiers:
  - UpperCase: 0.05

TitleCase

Makes the first letter of every word uppercase, and the rest lowercase. Words are split by spaces. E.g. 'heLLo' becomes 'Hello'.

modifiers:
  - TitleCase: 0.05

Typos

Introduce typos in the source side of the sentence pair.

The probability of the modifier itself is the chance a sentence is affected. The probabilities of each of the types of typos describes the chance a word is affected. Each type of typo occurs at most once in a sentence.

You can specify a probability for each modifier individually. If any of the typo classes is omitted, it has a probability of 0. Alternatively, you can omit all typo classes. Then all of them will have a default 10% probability.

modifiers:
- Typos: 0.05
  char_swap:     0.1 # Swaps two random consecutive word characters in the string.
  missing_char:  0.1 # Skips a random word character in the string.
  extra_char:    0.1 # Adds an extra, keyboard-neighbor, letter next to a random word character.
  nearby_char:   0.1 # Replaces a random word character with keyboard-neighbor letter.
  similar_char:  0.1 # Replaces a random word character with another visually similar character.
  skipped_space: 0.1 # Skips a random space from the string.
  random_space:  0.1 # Adds a random space in the string.
  repeated_char: 0.1 # Repeats a random word character.
  unichar:       0.1 # Replaces a random consecutive repeated letter with a single letter.

Prefix

Prepends a random subsection of the target sentence before the source sentence.

This is useful for teaching the model to force decode a specific string if the user is absolutely certain it has to appear in the output. For example I like pie. Me gustan los pasteles. becomes __start__ los pasteles __end__ I like pie. Me gustan los pasteles.

Note: The Prefix modifier must always be used as the last modifier, but ideally never used together with "Tags".

modifiers:
 - Prefix: 0.5
   min_words: 2
   max_words: 5
   template: "__start__ {trg} __end__ "

Generating vocabulary and tags before training

In the future, this will be handled by a training Pipeline, but until then here's the basic scripts used

For producing alignment augmented corpus use this script:

#!/bin/bash -v

# Usage: ./align_corpus.sh source_corpus target_corpus src trg

# install fast align
mkdir -p bin

# download and compile fast_align
if [ ! -e bin/fast_align ]; then
    git clone https://github.com/clab/fast_align
    mkdir -p fast_align/build
    cd fast_align/build
    cmake ..
    make -j4
    cp fast_align atools ../../bin
    cd ../../
fi

# Prepare the corpus for fast align
test -s $2/corpus.tmp.${3}-${4}.falign ||  cat $1 | sed 's/\t/ ||| /' > $2/corpus.tmp.${3}-${4}.falign

# Align it
test -s $2/align.${3}-${4}.s2t  || bin/fast_align -vod  -i $2/corpus.tmp.${3}-${4}.falign > $2/align.${3}-${4}.s2t
test -s $2/align.${3}-${4}.t2s  || bin/fast_align -vodr -i $2/corpus.tmp.${3}-${4}.falign > $2/align.${3}-${4}.t2s

test -s $2/corpus.${3}-${4}.aln || bin/atools -i $2/align.${3}-${4}.s2t -j $2/align.${3}-${4}.t2s -c grow-diag-final-and > $2/corpus.${3}-${4}.aln

For creating vocabulary with tags support, use this script:

#!/usr/bin/env bash
#Usage ./vocab.sh en de path-to-corpora char-cov vocab_size

char_cov=${4:-'0.9995'} # Default char coverage
vocab_size=${5:-'32000'} # Default vocab size
# Set up some constants

# Language pairs
src=$1
trg=$2
prefix="--model_prefix=model.${src}-${trg}"

# Placeholders array
placeholders="--user_defined_symbols=__source__,__target__,__done__,__start__,__end__"

# Character coverage. CJK is recommended to have 0.9995, vocab languages proabbly you want 1.
char_cov="--character_coverage=${char_cov}"

# First clone and compile SPM
spm_exec="sentencepiece/build/src/spm_train"
if [ ! -e ${spm_exec} ]; then
    git clone https://github.com/google/sentencepiece.git
    cd sentencepiece
    mkdir build
    cd build
    cmake ..
    make -j4
    cd ..
    cd ..
    if [ ! -e ${spm_exec} ]; then
        echo "Failed to compile sentencepiece"
        exit 1
    fi
fi

$spm_exec --bos_id=-1 --eos_id=0 --unk_id=1 ${placeholders} ${char_cov} ${prefix} --vocab_size=${vocab_size} --input=${3} --input_sentence_size=20000000 --byte_fallback #--input_format=tsv seems broken

Future work

Terminology support (using a dictionary), where augmentation happens not by using alignment scores but by taking values from a dictionary.
A one click run training

Acknowledgements

This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.2

Jun 22, 2023

0.1

Mar 3, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opustrainer-0.2.tar.gz (34.4 kB view hashes)

Uploaded Jun 22, 2023 Source

Built Distribution

opustrainer-0.2-py3-none-any.whl (28.0 kB view hashes)

Uploaded Jun 22, 2023 Python 3

Hashes for opustrainer-0.2.tar.gz

Hashes for opustrainer-0.2.tar.gz
Algorithm	Hash digest
SHA256	`e2572336175fe7501406e4c229aff3a2145f59c526b7e3f8abeab91d645ef549`
MD5	`a574d2e8571aca0ff2196e72da6cfc3e`
BLAKE2b-256	`c76491318007163f4880c31b97f455fe6183a74556b34f6fabf2b65c7453f60c`

Hashes for opustrainer-0.2-py3-none-any.whl

Hashes for opustrainer-0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c4b50d2e1e088302bb3092b51640debcc7d67fe4da92a991885a6d8d8df61f09`
MD5	`b0753a5320e7a7f798cd6295dc6ee5f2`
BLAKE2b-256	`edfabe39b48f3bec21805df19501ab5638bdce2e00df87efd2c219df1c6bc7b3`