End-to-end French-Wolof neural machine translation library with Bayesian hyperparameter optimization, custom tokenization, and support for T5, BART, and NLLB Transformer models.
Project description
translate-package
A Python library for French-Wolof neural machine translation using Transformer-based models (T5, BART, NLLB) with Bayesian hyperparameter optimization, custom tokenization, and a novel bucketing-truncation sampling strategy for low-resource languages.
Table of Contents
- Overview
- Key Contributions
- Installation
- Getting Started
- Workflow
- Supported Models
- Arguments Reference
- Results
- Citation
- License
Overview
translate-package implements the methodology described in the paper "Advancing Wolof-French Sentence Translation", which presents a comparative study of Transformer-based models for translating between French and Wolof — a low-resource language spoken in Senegal.
The library provides a full pipeline covering tokenizer training, Bayesian hyperparameter optimization, model fine-tuning, and evaluation, with a novel bucketing and truncation sampling strategy that reflects sentence length distribution during fine-tuning.
Key Contributions
-
Novel Bucketing + Truncation Sampling: Groups sequences of similar lengths into buckets for improved computational efficiency and coherence, combined with per-language maximum length truncation optimized as a hyperparameter. This combined strategy reflects the sentence length distribution during fine-tuning — particularly effective for low-resource language pairs.
-
Bayesian Hyperparameter Optimization: Uses a Gaussian Process framework with an Upper Confidence Bound (UCB) acquisition function, optimizing the BLEU score as the objective metric over the hyperparameter search space.
-
Custom Tokenization: Supports Byte Pair Encoding (BPE) for BART/LSTM and SentencePiece for T5, with vocabulary sizes adapted to the Wolof-French corpus.
-
Data Augmentation: Character-level substitutions and swaps via
nlpaugto compensate for the scarcity of Wolof parallel corpora. -
Multi-model Support: Fine-tune T5, BART, NLLB, or an LSTM baseline within the same pipeline.
Installation
pip install translate-package
After installation, extract the workflow scripts into your working directory:
translate-init --output-dir ./my_experiment
This copies the following scripts locally so you can inspect and run them:
my_experiment/
├── translate_tokenizer.py
├── translate_hyperparameter_tuning.py
├── translate_finetuning.py
├── translate_test.py
└── save_to_hub.py
Getting Started
All scripts use argparse and are run from the command line. The typical workflow follows four sequential steps:
Train Tokenizer → Hyperparameter Tuning → Fine-Tuning → Test
Optionally push the best model to the Hugging Face Hub with save_to_hub.py.
Workflow
1. Train the Tokenizer
Train a custom tokenizer on your parallel corpus before fine-tuning. This step is only required for T5 (SentencePiece) and BART/LSTM (BPE). NLLB uses its own built-in tokenizer and skips this step.
BPE tokenizer (for BART / LSTM):
python train_tokenizer.py \
--file_name sent_tokenizer \
--dataset_file corpus35k.csv \
--src_label french \
--tgt_label wolof \
--vocab_size 15000 \
--name bpe
SentencePiece tokenizer (for T5):
python train_tokenizer.py \
--file_name sent_tokenizer \
--dataset_file corpus35k.csv \
--src_label french \
--tgt_label wolof \
--vocab_size 15000 \
--name sp
--name value |
Tokenizer type | Used by |
|---|---|---|
bpe |
Byte Pair Encoding | BART, LSTM |
sp |
SentencePiece | T5 |
| (none needed) | Model's built-in tokenizer | NLLB |
2. Hyperparameter Tuning
Run Bayesian optimization to find the best learning rate, sequence lengths, augmentation probabilities, and other hyperparameters. Results are logged to Weights & Biases.
T5 example (French → Wolof):
python translate_hyperparameter_tuning.py \
--model_generation t5 \
--model_name google-t5/t5-base \
--tokenizer_name sp \
--use_bucketing \
--use_truncation \
--no-bidirectional \
--no-use_peft \
--no-save_model \
--save_artifact \
--file_name sent_tokenizer \
--data_path corpus35k.csv \
--src_label french \
--tgt_label wolof \
--num_workers 3 \
--min_lr 1e-4 --max_lr 1e-2 \
--min_src_max_len 72 --max_src_max_len 111 \
--min_tgt_max_len 67 --max_tgt_max_len 85 \
--epochs 1 \
--batch_size 64 \
--max_words 21 \
--min_nts 2245 --max_nts 2245 \
--project wolof-french-translation-p3-t5-truncation \
--key <YOUR_WANDB_KEY>
NLLB example (bidirectional):
python translate_hyperparameter_tuning.py \
--model_generation nllb \
--model_name facebook/nllb-200-distilled-600M \
--use_bucketing \
--use_truncation \
--no-bidirectional \
--bidirectional_tr \
--no-use_peft \
--no-save_artifact \
--no-save_model \
--data_path corpus35k.csv \
--src_label french \
--tgt_label wolof \
--num_workers 3 \
--min_lr 7e-6 --max_lr 1e-4 \
--min_src_max_len 76 --max_src_max_len 117 \
--min_tgt_max_len 76 --max_tgt_max_len 97 \
--epochs 1 \
--batch_size 64 \
--max_words 21 \
--min_nts 4485 --max_nts 4485 \
--project wolof-french-translation-p3-nllb-truncation-bid \
--key <YOUR_WANDB_KEY>
3. Fine-Tuning
Fine-tune the model using the best hyperparameters found in the previous step. Pass the artifact path from your W&B run.
T5 example:
python translate_finetuning.py \
--model_generation t5 \
--model_name google-t5/t5-base \
--tokenizer_name sp \
--use_bucketing \
--use_truncation \
--no-bidirectional \
--no-use_peft \
--file_name sent_tokenizer \
--data_path corpus35k.csv \
--src_label french \
--tgt_label wolof \
--num_workers 3 \
--lr 0.006254861077618995 \
--wd 0.004978258153761286 \
--src_max_len 97 \
--tgt_max_len 70 \
--p_word 0.010406709461818209 \
--p_char 0.9269797815116728 \
--max_epochs 5 \
--batch_size 64 \
--max_words 21 \
--nts 2245 \
--metric bleu --mode max \
--clean_ckpt_dir \
--run_name t5-fine-tuning \
--new_artifact_name machine_translation_best_model_t5_fr_wf \
--artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
--project wolof-french-translation-p3-t5-truncation \
--key <YOUR_WANDB_KEY>
NLLB example (bidirectional):
python translate_finetuning.py \
--model_generation nllb \
--model_name facebook/nllb-200-distilled-600M \
--use_bucketing \
--use_truncation \
--no-bidirectional \
--bidirectional_tr \
--no-use_peft \
--data_path corpus35k.csv \
--src_label french \
--tgt_label wolof \
--num_workers 3 \
--lr 0.00009877589029418412 \
--wd 0.003711074112216225 \
--src_max_len 96 \
--tgt_max_len 91 \
--p_word 0.06650106426899016 \
--p_char 0.7166733632090903 \
--max_epochs 5 \
--batch_size 64 \
--max_words 21 \
--nts 4485 \
--metric bleu --mode max \
--clean_ckpt_dir \
--run_name nllb-fine-tuning-fr-wf-bid \
--project wolof-french-translation-p3-nllb-truncation-bid \
--key <YOUR_WANDB_KEY>
4. Testing
Evaluate the best model checkpoint on the test set. Reports BLEU and ROUGE-L scores.
T5 example:
python translate_test.py \
--model_generation t5 \
--model_name google-t5/t5-base \
--tokenizer_name sp \
--use_bucketing \
--use_truncation \
--no-bidirectional \
--no-use_peft \
--file_name sent_tokenizer \
--data_path corpus35k.csv \
--src_label french \
--tgt_label wolof \
--num_workers 3 \
--src_max_len 97 \
--tgt_max_len 70 \
--p_word 0.010406709461818209 \
--p_char 0.9269797815116728 \
--batch_size 64 \
--max_words 21 \
--run_name t5-test-fr-wf \
--artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
--project wolof-french-translation-p3-t5-truncation \
--key <YOUR_WANDB_KEY>
NLLB example:
python translate_test.py \
--model_generation nllb \
--model_name facebook/nllb-200-distilled-600M \
--use_bucketing \
--use_truncation \
--no-bidirectional \
--bidirectional_tr \
--no-use_peft \
--data_path corpus35k.csv \
--src_label french \
--tgt_label wolof \
--num_workers 3 \
--src_max_len 96 \
--tgt_max_len 91 \
--p_word 0.06650106426899016 \
--p_char 0.7166733632090903 \
--batch_size 64 \
--max_words 21 \
--run_name nllb-test-fr-wf-bid \
--artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
--project wolof-french-translation-p3-nllb-truncation-bid \
--key <YOUR_WANDB_KEY>
5. Save to Hugging Face Hub
Push your best fine-tuned model directly to the Hugging Face Hub:
python save_to_hub.py \
--model_generation nllb \
--model_name facebook/nllb-200-distilled-600M \
--no-bidirectional \
--no-use_peft \
--src_label french \
--tgt_label wolof \
--run_name nllb-save-to-hub-fr-wf-bid \
--artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
--project wolof-french-translation-p3-nllb-truncation-bid \
--key <YOUR_WANDB_KEY> \
--token <YOUR_HF_TOKEN> \
--username <YOUR_HF_USERNAME> \
--repo_name nllb_french_wolof_bilateral
Supported Models
| Model | --model_generation |
--model_name example |
|---|---|---|
| T5 | t5 |
google-t5/t5-base, google-t5/t5-small |
| BART | bart |
facebook/bart-base |
| NLLB | nllb |
facebook/nllb-200-distilled-600M |
| LSTM | lstm |
— (trained from scratch) |
Arguments Reference
Shared arguments (all scripts)
| Argument | Type | Description |
|---|---|---|
--model_generation |
str |
Model family: t5, bart, nllb, lstm |
--model_name |
str |
Hugging Face model identifier |
--data_path |
str |
Path to the parallel corpus CSV file |
--src_label |
str |
Column name for source language (e.g. french) |
--tgt_label |
str |
Column name for target language (e.g. wolof) |
--batch_size |
int |
Batch size for training / evaluation |
--num_workers |
int |
Number of DataLoader workers |
--use_bucketing |
flag |
Enable bucketing sampling strategy |
--use_truncation |
flag |
Enable truncation of sequences |
--bidirectional_tr |
flag |
Train in both translation directions |
--use_peft |
flag |
Enable parameter-efficient fine-tuning (PEFT) |
--project |
str |
W&B project name |
--key |
str |
W&B API key |
Hyperparameter tuning specific
| Argument | Type | Description |
|---|---|---|
--min_lr / --max_lr |
float |
Learning rate search bounds |
--min_src_max_len / --max_src_max_len |
int |
Source max length search bounds |
--min_tgt_max_len / --max_tgt_max_len |
int |
Target max length search bounds |
--min_nts / --max_nts |
int |
Number of training steps bounds |
--epochs |
int |
Epochs per Bayesian trial |
--save_artifact |
flag |
Save best trial model as W&B artifact |
Fine-tuning specific
| Argument | Type | Description |
|---|---|---|
--lr |
float |
Learning rate (from tuning step) |
--wd |
float |
Weight decay |
--src_max_len |
int |
Source sequence max length |
--tgt_max_len |
int |
Target sequence max length |
--p_word |
float |
Word-level augmentation probability |
--p_char |
float |
Character-level augmentation probability |
--max_epochs |
int |
Maximum training epochs |
--metric |
str |
Metric to monitor (bleu, loss) |
--mode |
str |
Optimization direction (max or min) |
--artifact_path |
str |
W&B artifact path from tuning step |
--new_artifact_name |
str |
Name for the saved fine-tuned model artifact |
--clean_ckpt_dir |
flag |
Remove checkpoint directory after saving |
Results
All models were fine-tuned for 5 epochs on the French-Wolof parallel corpus using the bucketing-truncation sampling strategy and Bayesian hyperparameter optimization. Evaluation metrics include BLEU, ROUGE-1, ROUGE-2, and ROUGE-L.
French → Wolof
| Model | BLEU | ROUGE-1 | ROUGE-2 | ROUGE-L | Train Loss | Eval Loss |
|---|---|---|---|---|---|---|
| NLLB-200-distilled-600M | 17.19 | 0.4414 | 0.2236 | 0.4051 | 1.0484 | 1.9932 |
| T5-base | 7.38 | 0.2729 | 0.1010 | 0.2454 | 1.2241 | 3.2776 |
| BART* | — | — | — | — | — | — |
| LSTM* | — | — | — | — | — | — |
Wolof → French
| Model | BLEU | ROUGE-1 | ROUGE-2 | ROUGE-L | Train Loss | Eval Loss |
|---|---|---|---|---|---|---|
| NLLB-200-distilled-600M | 27.41 | 0.4833 | 0.3009 | 0.4518 | 0.7981 | 1.3009 |
| BART* | — | — | — | — | — | — |
| LSTM* | — | — | — | — | — | — |
NLLB-200 Bidirectional (Fr↔Wf joint training)
| Metric | Value |
|---|---|
| BLEU | 18.81 |
| ROUGE-1 | 0.4076 |
| ROUGE-2 | 0.2208 |
| ROUGE-L | 0.3750 |
| Train Loss | 0.9842 |
| Eval Loss | 1.7990 |
| Global Step | 4,485 |
* BART and LSTM results will be added upon completion. T5 fr→wf results correspond to
global_step: 2,245(5 epochs).
NLLB-200 significantly outperforms T5 in both translation directions, achieving a BLEU of 27.41 on Wolof → French — the highest score across all evaluated configurations. The bidirectional NLLB training strategy provides a strong single-model solution for both translation directions simultaneously. The combined bucketing-truncation strategy, Bayesian optimization, and character-level augmentation were key contributors across all models.
Citation
If you use this library or the methodology in your research, please cite:
@INPROCEEDINGS{10747017,
author = {Kane, Oumar and Bousso, Mamadou and Allaya, Mouhamad M. and Samb, Dame},
booktitle = {2024 Fifth International Conference on Intelligent Data Science Technologies and Applications (IDSTA)},
title = {Advancing Wolof-French Sentence Translation: Comparative Analysis of Transformer-Based Models and Methodological Insights},
year = {2024},
pages = {145--152},
doi = {10.1109/IDSTA62194.2024.10747017}
}
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file translate_package-0.5.7.tar.gz.
File metadata
- Download URL: translate_package-0.5.7.tar.gz
- Upload date:
- Size: 41.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a62f4af0e3198b86b534ade04ab8d4e48bd1861f96f0e65f1684177aba5430ac
|
|
| MD5 |
0b322adbe4b9d26d5a388776d1d4a568
|
|
| BLAKE2b-256 |
7140ae69d09644304410680d7b72c15eb08e278d3469270e83f9209289694982
|
File details
Details for the file translate_package-0.5.7-py3-none-any.whl.
File metadata
- Download URL: translate_package-0.5.7-py3-none-any.whl
- Upload date:
- Size: 34.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5fb5002f77f28af160ace2e9b85c9e51737bac25ea4e577f18e3f5b4d0d67fc
|
|
| MD5 |
ae34e5038357eb66048837590fc7c0ea
|
|
| BLAKE2b-256 |
967e84b319c03eeb410da635ce1ff296d94173b02a3c68a5e26ba557e9da867e
|