Train transformer-based models
Project description
Zelda Rose
A straightforward trainer for transformer-based models.
Installation
Simply install with pipx
pipx install zeldarose
Train MLM models
Here is a short example of training first a tokenizer, then a transformer MLM model:
TOKENIZERS_PARALLELISM=true zeldarose tokenizer --vocab-size 4096 --out-path local/tokenizer --model-name "my-muppet" tests/fixtures/raw.txt
zeldarose
transformer --tokenizer local/tokenizer --pretrained-model flaubert/flaubert_small_cased --out-dir local/muppet --val-text tests/fixtures/raw.txt tests/fixtures/raw.txt
The .txt
files are meant to be raw text files, with one sample (e.g. sentence) per line.
There are other parameters (see zeldarose transformer --help
for a comprehensive list), the one
you are probably mostly interested in is --config
, giving the path to a training config (for which
we have examples/
).
The parameters --pretrained-models
, --tokenizer
and --model-config
are all fed directly to
Huggingface's transformers
and can be pretrained
models names or local path.
Distributed training
This is somewhat tricky, you have several options
-
If you are running in a SLURM cluster use
--strategy ddp
and invoke viasrun
- You might want to preprocess your data first outside of the main compute allocation. The
--profile
option might be abused for that purpose, since it won't run a full training, but will run any data preprocessing you ask for. It might also be beneficial at this step to load a placeholder model such as RoBERTa-minuscule to avoid runnin out of memory, since the only thing that matter for this preprocessing is the tokenizer.
- You might want to preprocess your data first outside of the main compute allocation. The
-
Otherwise you have two options
- Run with
--strategy ddp_spawn
, which usesmultiprocessing.spawn
to start the process swarm (tested, but possibly slower and more limited, seepytorch-lightning
doc) - Run with
--strategy ddp
and start withtorch.distributed.launch
with--use_env
and--no_python
(untested)
- Run with
Other hints
- Data management relies on 🤗 datasets and use their cache management system. To run in a clear
environment, you might have to check the cache directory pointed to by the
HF_DATASETS_CACHE
environment variable.
Inspirations
- https://github.com/shoarora/lmtuners
- https://github.com/huggingface/transformers/blob/243e687be6cd701722cce050005a2181e78a08a8/examples/run_language_modeling.py
Citation
@inproceedings{grobol:hal-04262806,
TITLE = {{Zelda Rose: a tool for hassle-free training of transformer models}},
AUTHOR = {Grobol, Lo{\"i}c},
URL = {https://hal.science/hal-04262806},
BOOKTITLE = {{3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)}},
ADDRESS = {Singapore, Indonesia},
YEAR = {2023},
MONTH = Dec,
PDF = {https://hal.science/hal-04262806/file/Zeldarose_OSS_EMNLP23.pdf},
HAL_ID = {hal-04262806},
HAL_VERSION = {v1},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for zeldarose-0.11.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3f5da331686a795766ac0f2f89a7bad459ca75cd1d9e0e8ba34d72bc56e88935 |
|
MD5 | 232989071c458895c893c6fe3611a077 |
|
BLAKE2b-256 | ca1dfad01abba6e8232e70b1d93b2fd5d9c07c43a7dc5d713f97ce64d584b1d5 |