End-to-end French-Wolof neural machine translation library with Bayesian hyperparameter optimization, custom tokenization, and support for T5, BART, and NLLB Transformer models.

Project description

translate-package

A Python library for French-Wolof neural machine translation using Transformer-based models (T5, BART, NLLB) with Bayesian hyperparameter optimization, custom tokenization, and a novel bucketing-truncation sampling strategy for low-resource languages.

Overview
Key Contributions
Installation
Getting Started
Workflow
Supported Models
Arguments Reference
Results
Citation
License

Overview

translate-package implements the methodology described in the paper "Advancing Wolof-French Sentence Translation", which presents a comparative study of Transformer-based models for translating between French and Wolof — a low-resource language spoken in Senegal.

The library provides a full pipeline covering tokenizer training, Bayesian hyperparameter optimization, model fine-tuning, and evaluation, with a novel bucketing and truncation sampling strategy that reflects sentence length distribution during fine-tuning.

Key Contributions

Novel Bucketing + Truncation Sampling: Groups sequences of similar lengths into buckets for improved computational efficiency and coherence, combined with per-language maximum length truncation optimized as a hyperparameter. This combined strategy reflects the sentence length distribution during fine-tuning — particularly effective for low-resource language pairs.
Bayesian Hyperparameter Optimization: Uses a Gaussian Process framework with an Upper Confidence Bound (UCB) acquisition function, optimizing the BLEU score as the objective metric over the hyperparameter search space.
Custom Tokenization: Supports Byte Pair Encoding (BPE) for BART/LSTM and SentencePiece for T5, with vocabulary sizes adapted to the Wolof-French corpus.
Data Augmentation: Character-level substitutions and swaps via nlpaug to compensate for the scarcity of Wolof parallel corpora.
Multi-model Support: Fine-tune T5, BART, NLLB, or an LSTM baseline within the same pipeline.

Installation

pip install translate-package

After installation, extract the workflow scripts into your working directory:

translate-init --output-dir ./my_experiment

This copies the following scripts locally so you can inspect and run them:

my_experiment/
├── translate_tokenizer.py
├── translate_hyperparameter_tuning.py
├── translate_finetuning.py
├── translate_test.py
└── save_to_hub.py

Getting Started

All scripts use argparse and are run from the command line. The typical workflow follows four sequential steps:

Train Tokenizer → Hyperparameter Tuning → Fine-Tuning → Test

Optionally push the best model to the Hugging Face Hub with save_to_hub.py.

Workflow

1. Train the Tokenizer

Train a custom tokenizer on your parallel corpus before fine-tuning. This step is only required for T5 (SentencePiece) and BART/LSTM (BPE). NLLB uses its own built-in tokenizer and skips this step.

BPE tokenizer (for BART / LSTM):

python train_tokenizer.py \
  --file_name sent_tokenizer \
  --dataset_file corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --vocab_size 15000 \
  --name bpe

SentencePiece tokenizer (for T5):

python train_tokenizer.py \
  --file_name sent_tokenizer \
  --dataset_file corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --vocab_size 15000 \
  --name sp

`--name` value	Tokenizer type	Used by
`bpe`	Byte Pair Encoding	BART, LSTM
`sp`	SentencePiece	T5
(none needed)	Model's built-in tokenizer	NLLB

2. Hyperparameter Tuning

Run Bayesian optimization to find the best learning rate, sequence lengths, augmentation probabilities, and other hyperparameters. Results are logged to Weights & Biases.

T5 example (French → Wolof):

python translate_hyperparameter_tuning.py \
  --model_generation t5 \
  --model_name google-t5/t5-base \
  --tokenizer_name sp \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --no-use_peft \
  --no-save_model \
  --save_artifact \
  --file_name sent_tokenizer \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --min_lr 1e-4 --max_lr 1e-2 \
  --min_src_max_len 72 --max_src_max_len 111 \
  --min_tgt_max_len 67 --max_tgt_max_len 85 \
  --epochs 1 \
  --batch_size 64 \
  --max_words 21 \
  --min_nts 2245 --max_nts 2245 \
  --project wolof-french-translation-p3-t5-truncation \
  --key <YOUR_WANDB_KEY>

NLLB example (bidirectional):

python translate_hyperparameter_tuning.py \
  --model_generation nllb \
  --model_name facebook/nllb-200-distilled-600M \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --bidirectional_tr \
  --no-use_peft \
  --no-save_artifact \
  --no-save_model \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --min_lr 7e-6 --max_lr 1e-4 \
  --min_src_max_len 76 --max_src_max_len 117 \
  --min_tgt_max_len 76 --max_tgt_max_len 97 \
  --epochs 1 \
  --batch_size 64 \
  --max_words 21 \
  --min_nts 4485 --max_nts 4485 \
  --project wolof-french-translation-p3-nllb-truncation-bid \
  --key <YOUR_WANDB_KEY>

3. Fine-Tuning

Fine-tune the model using the best hyperparameters found in the previous step. Pass the artifact path from your W&B run.

T5 example:

python translate_finetuning.py \
  --model_generation t5 \
  --model_name google-t5/t5-base \
  --tokenizer_name sp \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --no-use_peft \
  --file_name sent_tokenizer \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --lr 0.006254861077618995 \
  --wd 0.004978258153761286 \
  --src_max_len 97 \
  --tgt_max_len 70 \
  --p_word 0.010406709461818209 \
  --p_char 0.9269797815116728 \
  --max_epochs 5 \
  --batch_size 64 \
  --max_words 21 \
  --nts 2245 \
  --metric bleu --mode max \
  --clean_ckpt_dir \
  --run_name t5-fine-tuning \
  --new_artifact_name machine_translation_best_model_t5_fr_wf \
  --artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
  --project wolof-french-translation-p3-t5-truncation \
  --key <YOUR_WANDB_KEY>

NLLB example (bidirectional):

python translate_finetuning.py \
  --model_generation nllb \
  --model_name facebook/nllb-200-distilled-600M \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --bidirectional_tr \
  --no-use_peft \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --lr 0.00009877589029418412 \
  --wd 0.003711074112216225 \
  --src_max_len 96 \
  --tgt_max_len 91 \
  --p_word 0.06650106426899016 \
  --p_char 0.7166733632090903 \
  --max_epochs 5 \
  --batch_size 64 \
  --max_words 21 \
  --nts 4485 \
  --metric bleu --mode max \
  --clean_ckpt_dir \
  --run_name nllb-fine-tuning-fr-wf-bid \
  --project wolof-french-translation-p3-nllb-truncation-bid \
  --key <YOUR_WANDB_KEY>

4. Testing

Evaluate the best model checkpoint on the test set. Reports BLEU and ROUGE-L scores.

T5 example:

python translate_test.py \
  --model_generation t5 \
  --model_name google-t5/t5-base \
  --tokenizer_name sp \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --no-use_peft \
  --file_name sent_tokenizer \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --src_max_len 97 \
  --tgt_max_len 70 \
  --p_word 0.010406709461818209 \
  --p_char 0.9269797815116728 \
  --batch_size 64 \
  --max_words 21 \
  --run_name t5-test-fr-wf \
  --artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
  --project wolof-french-translation-p3-t5-truncation \
  --key <YOUR_WANDB_KEY>

NLLB example:

python translate_test.py \
  --model_generation nllb \
  --model_name facebook/nllb-200-distilled-600M \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --bidirectional_tr \
  --no-use_peft \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --src_max_len 96 \
  --tgt_max_len 91 \
  --p_word 0.06650106426899016 \
  --p_char 0.7166733632090903 \
  --batch_size 64 \
  --max_words 21 \
  --run_name nllb-test-fr-wf-bid \
  --artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
  --project wolof-french-translation-p3-nllb-truncation-bid \
  --key <YOUR_WANDB_KEY>

5. Save to Hugging Face Hub

Push your best fine-tuned model directly to the Hugging Face Hub:

python save_to_hub.py \
  --model_generation nllb \
  --model_name facebook/nllb-200-distilled-600M \
  --no-bidirectional \
  --no-use_peft \
  --src_label french \
  --tgt_label wolof \
  --run_name nllb-save-to-hub-fr-wf-bid \
  --artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
  --project wolof-french-translation-p3-nllb-truncation-bid \
  --key <YOUR_WANDB_KEY> \
  --token <YOUR_HF_TOKEN> \
  --username <YOUR_HF_USERNAME> \
  --repo_name nllb_french_wolof_bilateral

Supported Models

Model	`--model_generation`	`--model_name` example
T5	`t5`	`google-t5/t5-base`, `google-t5/t5-small`
BART	`bart`	`facebook/bart-base`
NLLB	`nllb`	`facebook/nllb-200-distilled-600M`
LSTM	`lstm`	— (trained from scratch)

Arguments Reference

Shared arguments (all scripts)

Argument	Type	Description
`--model_generation`	`str`	Model family: `t5`, `bart`, `nllb`, `lstm`
`--model_name`	`str`	Hugging Face model identifier
`--data_path`	`str`	Path to the parallel corpus CSV file
`--src_label`	`str`	Column name for source language (e.g. `french`)
`--tgt_label`	`str`	Column name for target language (e.g. `wolof`)
`--batch_size`	`int`	Batch size for training / evaluation
`--num_workers`	`int`	Number of DataLoader workers
`--use_bucketing`	`flag`	Enable bucketing sampling strategy
`--use_truncation`	`flag`	Enable truncation of sequences
`--bidirectional_tr`	`flag`	Train in both translation directions
`--use_peft`	`flag`	Enable parameter-efficient fine-tuning (PEFT)
`--project`	`str`	W&B project name
`--key`	`str`	W&B API key

Hyperparameter tuning specific

Argument	Type	Description
`--min_lr` / `--max_lr`	`float`	Learning rate search bounds
`--min_src_max_len` / `--max_src_max_len`	`int`	Source max length search bounds
`--min_tgt_max_len` / `--max_tgt_max_len`	`int`	Target max length search bounds
`--min_nts` / `--max_nts`	`int`	Number of training steps bounds
`--epochs`	`int`	Epochs per Bayesian trial
`--save_artifact`	`flag`	Save best trial model as W&B artifact

Fine-tuning specific

Argument	Type	Description
`--lr`	`float`	Learning rate (from tuning step)
`--wd`	`float`	Weight decay
`--src_max_len`	`int`	Source sequence max length
`--tgt_max_len`	`int`	Target sequence max length
`--p_word`	`float`	Word-level augmentation probability
`--p_char`	`float`	Character-level augmentation probability
`--max_epochs`	`int`	Maximum training epochs
`--metric`	`str`	Metric to monitor (`bleu`, `loss`)
`--mode`	`str`	Optimization direction (`max` or `min`)
`--artifact_path`	`str`	W&B artifact path from tuning step
`--new_artifact_name`	`str`	Name for the saved fine-tuned model artifact
`--clean_ckpt_dir`	`flag`	Remove checkpoint directory after saving

Results

All models were fine-tuned for 5 epochs on the French-Wolof parallel corpus using the bucketing-truncation sampling strategy and Bayesian hyperparameter optimization. Evaluation metrics include BLEU, ROUGE-1, ROUGE-2, and ROUGE-L.

French → Wolof

Model	BLEU	ROUGE-1	ROUGE-2	ROUGE-L	Train Loss	Eval Loss
NLLB-200-distilled-600M	17.19	0.4414	0.2236	0.4051	1.0484	1.9932
T5-base	7.38	0.2729	0.1010	0.2454	1.2241	3.2776
BART*	—	—	—	—	—	—
LSTM*	—	—	—	—	—	—

Wolof → French

Model	BLEU	ROUGE-1	ROUGE-2	ROUGE-L	Train Loss	Eval Loss
NLLB-200-distilled-600M	27.41	0.4833	0.3009	0.4518	0.7981	1.3009
BART*	—	—	—	—	—	—
LSTM*	—	—	—	—	—	—

NLLB-200 Bidirectional (Fr↔Wf joint training)

Metric	Value
BLEU	18.81
ROUGE-1	0.4076
ROUGE-2	0.2208
ROUGE-L	0.3750
Train Loss	0.9842
Eval Loss	1.7990
Global Step	4,485

* BART and LSTM results will be added upon completion. T5 fr→wf results correspond to global_step: 2,245 (5 epochs).

NLLB-200 significantly outperforms T5 in both translation directions, achieving a BLEU of 27.41 on Wolof → French — the highest score across all evaluated configurations. The bidirectional NLLB training strategy provides a strong single-model solution for both translation directions simultaneously. The combined bucketing-truncation strategy, Bayesian optimization, and character-level augmentation were key contributors across all models.

Citation

If you use this library or the methodology in your research, please cite:

@INPROCEEDINGS{10747017,
  author    = {Kane, Oumar and Bousso, Mamadou and Allaya, Mouhamad M. and Samb, Dame},
  booktitle = {2024 Fifth International Conference on Intelligent Data Science Technologies and Applications (IDSTA)},
  title     = {Advancing Wolof-French Sentence Translation: Comparative Analysis of Transformer-Based Models and Methodological Insights},
  year      = {2024},
  pages     = {145--152},
  doi       = {10.1109/IDSTA62194.2024.10747017}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

Release history Release notifications | RSS feed

0.5.8

Jun 19, 2026

This version

0.5.7

Jun 19, 2026

0.5.6

Jun 16, 2026

0.5.5

Jun 16, 2026

0.5.4

Jun 16, 2026

0.5.3

Jun 15, 2026

0.5.2

Jun 15, 2026

0.5.1

Jun 9, 2026

0.5.0

Jun 9, 2026

0.4.9

Jun 9, 2026

0.4.8

Jun 9, 2026

0.4.7

Jun 9, 2026

0.4.6

Jun 9, 2026

0.4.5

Jun 8, 2026

0.4.4

Jun 8, 2026

0.4.3

Jun 8, 2026

0.4.2

Jun 8, 2026

0.4.1

Jun 6, 2026

0.4.0

Jun 6, 2026

0.3.9

Jun 6, 2026

0.3.8

Jun 6, 2026

0.3.7

Jun 6, 2026

0.3.6

Feb 27, 2026

0.3.5

Feb 7, 2026

0.3.4

Feb 6, 2026

0.3.3

Feb 5, 2026

0.3.2

Feb 5, 2026

0.3.1

Feb 5, 2026

0.3.0

Feb 5, 2026

0.2.9

Feb 5, 2026

0.2.8

Feb 4, 2026

0.2.7

Feb 4, 2026

0.2.6

Feb 4, 2026

0.2.5

Feb 4, 2026

0.2.4

Feb 4, 2026

0.2.3

Feb 4, 2026

0.2.2

Feb 4, 2026

0.2.1

Feb 4, 2026

0.2.0

Feb 4, 2026

0.1.9

Jul 19, 2025

0.1.8

Jul 19, 2025

0.1.7

Jul 19, 2025

0.1.6

Jul 19, 2025

0.1.5

Jul 19, 2025

0.1.4

Jul 14, 2025

0.1.3

Jul 10, 2025

0.1.2

Jul 10, 2025

0.1.1

Jul 9, 2025

0.1.0

Jul 9, 2025

0.0.9

Jul 9, 2025

0.0.8

Jul 3, 2025

0.0.7

Jul 3, 2025

0.0.6

Jul 3, 2025

0.0.5

Jun 30, 2025

0.0.4

Jun 30, 2025

0.0.3

Jun 30, 2025

0.0.2

Jun 30, 2025

0.0.1

Jun 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

translate_package-0.5.7.tar.gz (41.0 kB view details)

Uploaded Jun 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

translate_package-0.5.7-py3-none-any.whl (34.4 kB view details)

Uploaded Jun 19, 2026 Python 3

File details

Details for the file translate_package-0.5.7.tar.gz.

File metadata

Download URL: translate_package-0.5.7.tar.gz
Upload date: Jun 19, 2026
Size: 41.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.1

File hashes

Hashes for translate_package-0.5.7.tar.gz
Algorithm	Hash digest
SHA256	`a62f4af0e3198b86b534ade04ab8d4e48bd1861f96f0e65f1684177aba5430ac`
MD5	`0b322adbe4b9d26d5a388776d1d4a568`
BLAKE2b-256	`7140ae69d09644304410680d7b72c15eb08e278d3469270e83f9209289694982`

See more details on using hashes here.

File details

Details for the file translate_package-0.5.7-py3-none-any.whl.

File metadata

Download URL: translate_package-0.5.7-py3-none-any.whl
Upload date: Jun 19, 2026
Size: 34.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.1

File hashes

Hashes for translate_package-0.5.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c5fb5002f77f28af160ace2e9b85c9e51737bac25ea4e577f18e3f5b4d0d67fc`
MD5	`ae34e5038357eb66048837590fc7c0ea`
BLAKE2b-256	`967e84b319c03eeb410da635ce1ff296d94173b02a3c68a5e26ba557e9da867e`

See more details on using hashes here.

translate-package 0.5.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

translate-package

Table of Contents

Overview

Key Contributions

Installation

Getting Started

Workflow

1. Train the Tokenizer

2. Hyperparameter Tuning

3. Fine-Tuning

4. Testing

5. Save to Hugging Face Hub

Supported Models

Arguments Reference

Shared arguments (all scripts)

Hyperparameter tuning specific

Fine-tuning specific

Results

French → Wolof

Wolof → French

NLLB-200 Bidirectional (Fr↔Wf joint training)

Citation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes