Skip to main content

End-to-end French-Wolof neural machine translation library with Bayesian hyperparameter optimization, custom tokenization, and support for T5, BART, and NLLB Transformer models.

Project description

translate-package

A Python library for French-Wolof neural machine translation using Transformer-based models (T5, BART, NLLB) with Bayesian hyperparameter optimization, custom tokenization, and a novel bucketing-truncation sampling strategy for low-resource languages.


Table of Contents


Overview

translate-package implements the methodology described in the paper "Advancing Wolof-French Sentence Translation", which presents a comparative study of Transformer-based models for translating between French and Wolof — a low-resource language spoken in Senegal.

The library provides a full pipeline covering tokenizer training, Bayesian hyperparameter optimization, model fine-tuning, and evaluation, with a novel bucketing and truncation sampling strategy that reflects sentence length distribution during fine-tuning.


Key Contributions

  • Novel Bucketing + Truncation Sampling: Groups sequences of similar lengths into buckets for improved computational efficiency and coherence, combined with per-language maximum length truncation optimized as a hyperparameter. This combined strategy reflects the sentence length distribution during fine-tuning — particularly effective for low-resource language pairs.

  • Bayesian Hyperparameter Optimization: Uses a Gaussian Process framework with an Upper Confidence Bound (UCB) acquisition function, optimizing the BLEU score as the objective metric over the hyperparameter search space.

  • Custom Tokenization: Supports Byte Pair Encoding (BPE) for BART/LSTM and SentencePiece for T5, with vocabulary sizes adapted to the Wolof-French corpus.

  • Data Augmentation: Character-level substitutions and swaps via nlpaug to compensate for the scarcity of Wolof parallel corpora.

  • Multi-model Support: Fine-tune T5, BART, NLLB, or an LSTM baseline within the same pipeline.


Installation

pip install translate-package

After installation, extract the workflow scripts into your working directory:

translate-init --output-dir ./my_experiment

This copies the following scripts locally so you can inspect and run them:

my_experiment/
├── translate_tokenizer.py
├── translate_hyperparameter_tuning.py
├── translate_finetuning.py
├── translate_test.py
└── save_to_hub.py

Getting Started

All scripts use argparse and are run from the command line. The typical workflow follows four sequential steps:

Train Tokenizer → Hyperparameter Tuning → Fine-Tuning → Test

Optionally push the best model to the Hugging Face Hub with save_to_hub.py.


Workflow

1. Train the Tokenizer

Train a custom tokenizer on your parallel corpus before fine-tuning. This step is only required for T5 (SentencePiece) and BART/LSTM (BPE). NLLB uses its own built-in tokenizer and skips this step.

BPE tokenizer (for BART / LSTM):

python train_tokenizer.py \
  --file_name sent_tokenizer \
  --dataset_file corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --vocab_size 15000 \
  --name bpe

SentencePiece tokenizer (for T5):

python train_tokenizer.py \
  --file_name sent_tokenizer \
  --dataset_file corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --vocab_size 15000 \
  --name sp
--name value Tokenizer type Used by
bpe Byte Pair Encoding BART, LSTM
sp SentencePiece T5
(none needed) Model's built-in tokenizer NLLB

2. Hyperparameter Tuning

Run Bayesian optimization to find the best learning rate, sequence lengths, augmentation probabilities, and other hyperparameters. Results are logged to Weights & Biases.

T5 example (French → Wolof):

python translate_hyperparameter_tuning.py \
  --model_generation t5 \
  --model_name google-t5/t5-base \
  --tokenizer_name sp \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --no-use_peft \
  --no-save_model \
  --save_artifact \
  --file_name sent_tokenizer \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --min_lr 1e-4 --max_lr 1e-2 \
  --min_src_max_len 72 --max_src_max_len 111 \
  --min_tgt_max_len 67 --max_tgt_max_len 85 \
  --epochs 1 \
  --batch_size 64 \
  --max_words 21 \
  --min_nts 2245 --max_nts 2245 \
  --project wolof-french-translation-p3-t5-truncation \
  --key <YOUR_WANDB_KEY>

NLLB example (bidirectional):

python translate_hyperparameter_tuning.py \
  --model_generation nllb \
  --model_name facebook/nllb-200-distilled-600M \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --bidirectional_tr \
  --no-use_peft \
  --no-save_artifact \
  --no-save_model \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --min_lr 7e-6 --max_lr 1e-4 \
  --min_src_max_len 76 --max_src_max_len 117 \
  --min_tgt_max_len 76 --max_tgt_max_len 97 \
  --epochs 1 \
  --batch_size 64 \
  --max_words 21 \
  --min_nts 4485 --max_nts 4485 \
  --project wolof-french-translation-p3-nllb-truncation-bid \
  --key <YOUR_WANDB_KEY>

3. Fine-Tuning

Fine-tune the model using the best hyperparameters found in the previous step. Pass the artifact path from your W&B run.

T5 example:

python translate_finetuning.py \
  --model_generation t5 \
  --model_name google-t5/t5-base \
  --tokenizer_name sp \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --no-use_peft \
  --file_name sent_tokenizer \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --lr 0.006254861077618995 \
  --wd 0.004978258153761286 \
  --src_max_len 97 \
  --tgt_max_len 70 \
  --p_word 0.010406709461818209 \
  --p_char 0.9269797815116728 \
  --max_epochs 5 \
  --batch_size 64 \
  --max_words 21 \
  --nts 2245 \
  --metric bleu --mode max \
  --clean_ckpt_dir \
  --run_name t5-fine-tuning \
  --new_artifact_name machine_translation_best_model_t5_fr_wf \
  --artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
  --project wolof-french-translation-p3-t5-truncation \
  --key <YOUR_WANDB_KEY>

NLLB example (bidirectional):

python translate_finetuning.py \
  --model_generation nllb \
  --model_name facebook/nllb-200-distilled-600M \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --bidirectional_tr \
  --no-use_peft \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --lr 0.00009877589029418412 \
  --wd 0.003711074112216225 \
  --src_max_len 96 \
  --tgt_max_len 91 \
  --p_word 0.06650106426899016 \
  --p_char 0.7166733632090903 \
  --max_epochs 5 \
  --batch_size 64 \
  --max_words 21 \
  --nts 4485 \
  --metric bleu --mode max \
  --clean_ckpt_dir \
  --run_name nllb-fine-tuning-fr-wf-bid \
  --project wolof-french-translation-p3-nllb-truncation-bid \
  --key <YOUR_WANDB_KEY>

4. Testing

Evaluate the best model checkpoint on the test set. Reports BLEU and ROUGE-L scores.

T5 example:

python translate_test.py \
  --model_generation t5 \
  --model_name google-t5/t5-base \
  --tokenizer_name sp \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --no-use_peft \
  --file_name sent_tokenizer \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --src_max_len 97 \
  --tgt_max_len 70 \
  --p_word 0.010406709461818209 \
  --p_char 0.9269797815116728 \
  --batch_size 64 \
  --max_words 21 \
  --run_name t5-test-fr-wf \
  --artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
  --project wolof-french-translation-p3-t5-truncation \
  --key <YOUR_WANDB_KEY>

NLLB example:

python translate_test.py \
  --model_generation nllb \
  --model_name facebook/nllb-200-distilled-600M \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --bidirectional_tr \
  --no-use_peft \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --src_max_len 96 \
  --tgt_max_len 91 \
  --p_word 0.06650106426899016 \
  --p_char 0.7166733632090903 \
  --batch_size 64 \
  --max_words 21 \
  --run_name nllb-test-fr-wf-bid \
  --artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
  --project wolof-french-translation-p3-nllb-truncation-bid \
  --key <YOUR_WANDB_KEY>

5. Save to Hugging Face Hub

Push your best fine-tuned model directly to the Hugging Face Hub:

python save_to_hub.py \
  --model_generation nllb \
  --model_name facebook/nllb-200-distilled-600M \
  --no-bidirectional \
  --no-use_peft \
  --src_label french \
  --tgt_label wolof \
  --run_name nllb-save-to-hub-fr-wf-bid \
  --artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
  --project wolof-french-translation-p3-nllb-truncation-bid \
  --key <YOUR_WANDB_KEY> \
  --token <YOUR_HF_TOKEN> \
  --username <YOUR_HF_USERNAME> \
  --repo_name nllb_french_wolof_bilateral

Supported Models

Model --model_generation --model_name example
T5 t5 google-t5/t5-base, google-t5/t5-small
BART bart facebook/bart-base
NLLB nllb facebook/nllb-200-distilled-600M
LSTM lstm — (trained from scratch)

Arguments Reference

Shared arguments (all scripts)

Argument Type Description
--model_generation str Model family: t5, bart, nllb, lstm
--model_name str Hugging Face model identifier
--data_path str Path to the parallel corpus CSV file
--src_label str Column name for source language (e.g. french)
--tgt_label str Column name for target language (e.g. wolof)
--batch_size int Batch size for training / evaluation
--num_workers int Number of DataLoader workers
--use_bucketing flag Enable bucketing sampling strategy
--use_truncation flag Enable truncation of sequences
--bidirectional_tr flag Train in both translation directions
--use_peft flag Enable parameter-efficient fine-tuning (PEFT)
--project str W&B project name
--key str W&B API key

Hyperparameter tuning specific

Argument Type Description
--min_lr / --max_lr float Learning rate search bounds
--min_src_max_len / --max_src_max_len int Source max length search bounds
--min_tgt_max_len / --max_tgt_max_len int Target max length search bounds
--min_nts / --max_nts int Number of training steps bounds
--epochs int Epochs per Bayesian trial
--save_artifact flag Save best trial model as W&B artifact

Fine-tuning specific

Argument Type Description
--lr float Learning rate (from tuning step)
--wd float Weight decay
--src_max_len int Source sequence max length
--tgt_max_len int Target sequence max length
--p_word float Word-level augmentation probability
--p_char float Character-level augmentation probability
--max_epochs int Maximum training epochs
--metric str Metric to monitor (bleu, loss)
--mode str Optimization direction (max or min)
--artifact_path str W&B artifact path from tuning step
--new_artifact_name str Name for the saved fine-tuned model artifact
--clean_ckpt_dir flag Remove checkpoint directory after saving

Results

All models were fine-tuned for 5 epochs on the French-Wolof parallel corpus using the bucketing-truncation sampling strategy and Bayesian hyperparameter optimization. Evaluation metrics include BLEU, ROUGE-1, ROUGE-2, and ROUGE-L.

French → Wolof

Model BLEU ROUGE-1 ROUGE-2 ROUGE-L Train Loss Eval Loss
NLLB-200-distilled-600M 17.19 0.4414 0.2236 0.4051 1.0484 1.9932
T5-base 7.38 0.2729 0.1010 0.2454 1.2241 3.2776
BART*
LSTM*

Wolof → French

Model BLEU ROUGE-1 ROUGE-2 ROUGE-L Train Loss Eval Loss
NLLB-200-distilled-600M 27.41 0.4833 0.3009 0.4518 0.7981 1.3009
BART*
LSTM*

NLLB-200 Bidirectional (Fr↔Wf joint training)

Metric Value
BLEU 18.81
ROUGE-1 0.4076
ROUGE-2 0.2208
ROUGE-L 0.3750
Train Loss 0.9842
Eval Loss 1.7990
Global Step 4,485

* BART and LSTM results will be added upon completion. T5 fr→wf results correspond to global_step: 2,245 (5 epochs).

NLLB-200 significantly outperforms T5 in both translation directions, achieving a BLEU of 27.41 on Wolof → French — the highest score across all evaluated configurations. The bidirectional NLLB training strategy provides a strong single-model solution for both translation directions simultaneously. The combined bucketing-truncation strategy, Bayesian optimization, and character-level augmentation were key contributors across all models.


Citation

If you use this library or the methodology in your research, please cite:

@INPROCEEDINGS{10747017,
  author    = {Kane, Oumar and Bousso, Mamadou and Allaya, Mouhamad M. and Samb, Dame},
  booktitle = {2024 Fifth International Conference on Intelligent Data Science Technologies and Applications (IDSTA)},
  title     = {Advancing Wolof-French Sentence Translation: Comparative Analysis of Transformer-Based Models and Methodological Insights},
  year      = {2024},
  pages     = {145--152},
  doi       = {10.1109/IDSTA62194.2024.10747017}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

translate_package-0.5.7.tar.gz (41.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

translate_package-0.5.7-py3-none-any.whl (34.4 kB view details)

Uploaded Python 3

File details

Details for the file translate_package-0.5.7.tar.gz.

File metadata

  • Download URL: translate_package-0.5.7.tar.gz
  • Upload date:
  • Size: 41.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.1

File hashes

Hashes for translate_package-0.5.7.tar.gz
Algorithm Hash digest
SHA256 a62f4af0e3198b86b534ade04ab8d4e48bd1861f96f0e65f1684177aba5430ac
MD5 0b322adbe4b9d26d5a388776d1d4a568
BLAKE2b-256 7140ae69d09644304410680d7b72c15eb08e278d3469270e83f9209289694982

See more details on using hashes here.

File details

Details for the file translate_package-0.5.7-py3-none-any.whl.

File metadata

File hashes

Hashes for translate_package-0.5.7-py3-none-any.whl
Algorithm Hash digest
SHA256 c5fb5002f77f28af160ace2e9b85c9e51737bac25ea4e577f18e3f5b4d0d67fc
MD5 ae34e5038357eb66048837590fc7c0ea
BLAKE2b-256 967e84b319c03eeb410da635ce1ff296d94173b02a3c68a5e26ba557e9da867e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page