Train transformer-based models

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Zelda Rose

A straightforward trainer for transformer-based models.

Installation

Simply install with pipx

pipx install zeldarose

Train MLM models

Here is a short example of training first a tokenizer, then a transformer MLM model:

TOKENIZERS_PARALLELISM=true zeldarose tokenizer --vocab-size 4096 --out-path local/tokenizer  --model-name "my-muppet" tests/fixtures/raw.txt
zeldarose 
transformer --tokenizer local/tokenizer --pretrained-model flaubert/flaubert_small_cased --out-dir local/muppet --val-text tests/fixtures/raw.txt tests/fixtures/raw.txt

The .txt files are meant to be raw text files, with one sample (e.g. sentence) per line.

There are other parameters (see zeldarose transformer --help for a comprehensive list), the one you are probably mostly interested in is --config, giving the path to a training config (for which we have examples/).

The parameters --pretrained-models, --tokenizer and --model-config are all fed directly to Huggingface's transformers and can be pretrained models names or local path.

Distributed training

This is somewhat tricky, you have several options

If you are running in a SLURM cluster use --strategy ddp and invoke via srun
- You might want to preprocess your data first outside of the main compute allocation. The --profile option might be abused for that purpose, since it won't run a full training, but will run any data preprocessing you ask for. It might also be beneficial at this step to load a placeholder model such as RoBERTa-minuscule to avoid runnin out of memory, since the only thing that matter for this preprocessing is the tokenizer.
Otherwise you have two options
- Run with --strategy ddp_spawn, which uses multiprocessing.spawn to start the process swarm (tested, but possibly slower and more limited, see pytorch-lightning doc)
- Run with --strategy ddp and start with torch.distributed.launch with --use_env and --no_python (untested)

Other hints

Data management relies on 🤗 datasets and use their cache management system. To run in a clear environment, you might have to check the cache directory pointed to by theHF_DATASETS_CACHE environment variable.

Inspirations

Citation

@inproceedings{grobol:hal-04262806,
    TITLE = {{Zelda Rose: a tool for hassle-free training of transformer models}},
    AUTHOR = {Grobol, Lo{\"i}c},
    URL = {https://hal.science/hal-04262806},
    BOOKTITLE = {{3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)}},
    ADDRESS = {Singapore, Indonesia},
    YEAR = {2023},
    MONTH = Dec,
    PDF = {https://hal.science/hal-04262806/file/Zeldarose_OSS_EMNLP23.pdf},
    HAL_ID = {hal-04262806},
    HAL_VERSION = {v1},
}

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.9.0

Apr 17, 2024

0.8.0

Oct 6, 2023

0.7.3

Feb 27, 2023

0.7.2

Feb 26, 2023

0.7.1

Feb 25, 2023

0.7.0

Feb 25, 2023

0.6.0

Jul 28, 2022

0.5.0

Mar 31, 2022

0.4.0

Mar 18, 2022

0.3.4

Dec 21, 2021

0.3.3

Oct 2, 2021

0.3.2

May 31, 2021

0.3.0

May 11, 2021

0.2.0

Apr 23, 2021

0.1.1

Apr 6, 2021

0.1.0

Apr 6, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zeldarose-0.9.0.tar.gz (31.3 kB view hashes)

Uploaded Apr 17, 2024 Source

Built Distribution

zeldarose-0.9.0-py3-none-any.whl (39.2 kB view hashes)

Uploaded Apr 17, 2024 Python 3

Hashes for zeldarose-0.9.0.tar.gz

Hashes for zeldarose-0.9.0.tar.gz
Algorithm	Hash digest
SHA256	`0d178df962fe355d15d07414f86c1ff3a1d961f8a7ce4902e379c1498f156af6`
MD5	`72e3b0937d924b90b36275d48d1ce42c`
BLAKE2b-256	`92649b88ec1e113581da48b1b33ed3c26c1e98303ea7f95fcf475c8451493074`

Hashes for zeldarose-0.9.0-py3-none-any.whl

Hashes for zeldarose-0.9.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f0aba9572d0c33f96dda8352ead474c056cb15e6dd3e895372dfe511c9c2de28`
MD5	`e16ecd0bf35fad47b3928ee2e5506256`
BLAKE2b-256	`784add9d54d0114299c3713b159550f4b015194df9be82a68e7fb830d9068969`