A machine translation library utilizing m2m100 models, equipped with features for generating diverse verb variants via VerbNet and Conditional Beam Search to enrich Virtual Assistant training sets.
Project description
Multiverb IVA MT
Generating diverse verb variants with VerbNet and Conditional Beam Search for enhanced performance of Intelligent Virtual Assistants (IVA) training set translation.
Installation
You can easily install multiverb_iva_mt from PyPI:
pip install iva_mt
This command will download and install the latest version of multiverb_iva_mt along with its required dependencies.
Usage
from iva_mt.iva_mt import IVAMT
translator = IVAMT(src_lang="en", tgt_lang="pl")
#for single-best translation
translator.translate("set the temperature on <a>my<a> thermostat")
#for multi-variant translation
translator.generate_alternative_translations("set the temperature on <a>my<a> thermostat")
Available languages (en2xx): pl, es, de, fr, pt, sv, zh, ja, tr, hi
To use GPU and batching, provide information about device:
IVAMT(src_lang="en", tgt_lang="pl", device="cuda:0", batch_size=16)
On V100 this allows to translate ~100 sentences/minute.
To use baseline M2M100:
IVAMT(src_lang="en", tgt_lang="pl", model_name="facebook/m2m100_418M")
To load a local model from an archive:
# If the model archive is located at /path/to/your/model.tgz, it will be automatically extracted
# to the ~/.cache/huggingface/hub directory. Specify this path using the `model_name` parameter.
IVAMT(src_lang="en", tgt_lang="pl", model_name="/path/to/your/model.tgz")
Note: When loading a local model, the tokenizer used will still be cartesinus/iva_mt_wslot-m2m100_418M-{src_lang}-{tgt_lang}
to ensure compatibility and optimal performance.
Training M2M100 Model
In this repository, we provide a script train.py
to facilitate the training of M2M100 models on your specified translation tasks. To run the training script, it is recommended to have a GPU for computational acceleration. When training on Google Colab, it's advisable to use an A100 GPU as V100 might not have sufficient memory.
Prerequisites
- Ensure that you have installed the necessary libraries by running the following command:
pip install transformers datasets sacrebleu
Usage
-
Customize your training configuration by creating a JSON file (e.g.,
config/iva_mt_wslot-m2m100_418M-en-pl.json
). In this file, specify the source language, target language, learning rate, weight decay, number of training epochs, and other relevant parameters. -
Execute the training script by running the following command:
python train.py --config config/iva_mt_wslot-m2m100_418M-en-pl.json
Configuration File
The configuration file should contain the following parameters:
src_lang
: Source language code (e.g., "en" for English).tgt_lang
: Target language code (e.g., "pl" for Polish).learning_rate
: Learning rate for the optimizer.weight_decay
: Weight decay for the optimizer.num_train_epochs
: Number of training epochs.model_space
: The namespace for the model.model_name
: The name of the model.dataset
: The name of the dataset to be used for training.
Example Configuration:
{
"src_lang": "en",
"tgt_lang": "pl",
"learning_rate": 5e-5,
"weight_decay": 0.01,
"num_train_epochs": 3,
"model_space": "facebook",
"model_name": "m2m100_418M",
"dataset": "wmt16"
}
Running on Google Colab
If you are running the script on Google Colab, ensure to switch to a runtime with a GPU for better performance. It is recommended to use an A100 GPU as V100 might have memory limitations depending on the size of the model and the dataset.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.