A machine translation library utilizing m2m100 models, equipped with features for generating diverse verb variants via VerbNet and Conditional Beam Search to enrich Virtual Assistant training sets.
Project description
Multiverb IVA MT
Generating diverse verb variants with VerbNet and Conditional Beam Search for enhanced performance of Intelligent Virtual Assistants (IVA) training set translation.
Usage:
from iva_mt.iva_mt import IVAMT
translator = IVAMT(src_lang="en", tgt_lang="pl")
#for single-best translation
translator.translate("set the temperature on <a>my<a> thermostat")
#for multi-variant translation
translator.generate_alternative_translations("set the temperature on <a>my<a> thermostat")
Available languages (en2xx): pl, es, de, fr, pt, sv, zh, ja, tr, hi
To use GPU and batching, provide information about device:
IVAMT(src_lang="en", tgt_lang="pl", device="cuda:0", batch_size=16)
On V100 this allows to translate ~100 sentences/minute.
To use baseline M2M100:
IVAMT(src_lang="en", tgt_lang="pl", model_name="facebook/m2m100_418M")
Training M2M100 Model
In this repository, we provide a script train.py
to facilitate the training of M2M100 models on your specified translation tasks. To run the training script, it is recommended to have a GPU for computational acceleration. When training on Google Colab, it's advisable to use an A100 GPU as V100 might not have sufficient memory.
Prerequisites
- Ensure that you have installed the necessary libraries by running the following command:
pip install transformers datasets sacrebleu
Usage
-
Customize your training configuration by creating a JSON file (e.g.,
config/iva_mt_wslot-m2m100_418M-en-pl.json
). In this file, specify the source language, target language, learning rate, weight decay, number of training epochs, and other relevant parameters. -
Execute the training script by running the following command:
python train.py --config config/iva_mt_wslot-m2m100_418M-en-pl.json
Configuration File
The configuration file should contain the following parameters:
src_lang
: Source language code (e.g., "en" for English).tgt_lang
: Target language code (e.g., "pl" for Polish).learning_rate
: Learning rate for the optimizer.weight_decay
: Weight decay for the optimizer.num_train_epochs
: Number of training epochs.model_space
: The namespace for the model.model_name
: The name of the model.dataset
: The name of the dataset to be used for training.
Example Configuration:
{
"src_lang": "en",
"tgt_lang": "pl",
"learning_rate": 5e-5,
"weight_decay": 0.01,
"num_train_epochs": 3,
"model_space": "facebook",
"model_name": "m2m100_418M",
"dataset": "wmt16"
}
Running on Google Colab
If you are running the script on Google Colab, ensure to switch to a runtime with a GPU for better performance. It is recommended to use an A100 GPU as V100 might have memory limitations depending on the size of the model and the dataset.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.