Train transformer-based models
Project description
Zelda Rose
A straightforward trainer for transformer-based models.
Installation
Simply install with pipx
pipx install zeldarose
Train MLM models
Here is a short example of training first a tokenizer, then a transformer MLM model:
TOKENIZERS_PARALLELISM=true zeldarose tokenizer --vocab-size 4096 --out-path local/tokenizer --model-name "my-muppet" tests/fixtures/raw.txt
zeldarose
transformer --tokenizer local/tokenizer --pretrained-model flaubert/flaubert_small_cased --out-dir local/muppet --val-text tests/fixtures/raw.txt tests/fixtures/raw.txt
The .txt
files are meant to be raw text files, with one sample (e.g. sentence) per line.
There are other parameters (see zeldarose transformer --help
for a comprehensive list), the one
you are probably mostly interested in is --config
, giving the path to a training config (for which
we have examples/
).
The parameters --pretrained-models
, --tokenizer
and --model-config
are all fed directly to
Huggingface's transformers
and can be pretrained
models names or local path.
Distributed training
This is somewhat tricky, you have several options
-
If you are running in a SLURM cluster use
--strategy ddp
and invoke viasrun
- You might want to preprocess your data first outside of the main compute allocation. The
--profile
option might be abused for that purpose, since it won't run a full training, but will run any data preprocessing you ask for. It might also be beneficial at this step to load a placeholder model such as RoBERTa-minuscule to avoid runnin out of memory, since the only thing that matter for this preprocessing is the tokenizer.
- You might want to preprocess your data first outside of the main compute allocation. The
-
Otherwise you have two options
- Run with
--strategy ddp_spawn
, which usesmultiprocessing.spawn
to start the process swarm (tested, but possibly slower and more limited, seepytorch-lightning
doc) - Run with
--strategy ddp
and start withtorch.distributed.launch
with--use_env
and--no_python
(untested)
- Run with
Other hints
- Data management relies on 🤗 datasets and use their cache management system. To run in a clear
environment, you might have to check the cache directory pointed to by the
HF_DATASETS_CACHE
environment variable.
Inspirations
- https://github.com/shoarora/lmtuners
- https://github.com/huggingface/transformers/blob/243e687be6cd701722cce050005a2181e78a08a8/examples/run_language_modeling.py
Citation
@inproceedings{grobol:hal-04262806,
TITLE = {{Zelda Rose: a tool for hassle-free training of transformer models}},
AUTHOR = {Grobol, Lo{\"i}c},
URL = {https://hal.science/hal-04262806},
BOOKTITLE = {{3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)}},
ADDRESS = {Singapore, Indonesia},
YEAR = {2023},
MONTH = Dec,
PDF = {https://hal.science/hal-04262806/file/Zeldarose_OSS_EMNLP23.pdf},
HAL_ID = {hal-04262806},
HAL_VERSION = {v1},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file zeldarose-0.11.0.tar.gz
.
File metadata
- Download URL: zeldarose-0.11.0.tar.gz
- Upload date:
- Size: 31.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c71b4744dfa16592bac461021607e75ac51efb371f9354f30f45a9c157dd194c |
|
MD5 | 53fa84c0bcb5772a8ce308a295a2c3c0 |
|
BLAKE2b-256 | e0404eeb83d31fe035c05170dae241e3d09b1fd9dc3989b295d7afa45d7126e5 |
File details
Details for the file zeldarose-0.11.0-py3-none-any.whl
.
File metadata
- Download URL: zeldarose-0.11.0-py3-none-any.whl
- Upload date:
- Size: 39.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3f5da331686a795766ac0f2f89a7bad459ca75cd1d9e0e8ba34d72bc56e88935 |
|
MD5 | 232989071c458895c893c6fe3611a077 |
|
BLAKE2b-256 | ca1dfad01abba6e8232e70b1d93b2fd5d9c07c43a7dc5d713f97ce64d584b1d5 |