Skip to main content

Train transformer-based models

Project description

Zelda Rose

Latest PyPI version Build Status Code style: black Documentation Status

A straightforward trainer for transformer-based models.

Installation

Simply install with pipx

pipx install zeldarose

Train MLM models

Here is a short example of training first a tokenizer, then a transformer MLM model:

TOKENIZERS_PARALLELISM=true zeldarose tokenizer --vocab-size 4096 --out-path local/tokenizer  --model-name "my-muppet" tests/fixtures/raw.txt
zeldarose 
transformer --tokenizer local/tokenizer --pretrained-model flaubert/flaubert_small_cased --out-dir local/muppet --val-text tests/fixtures/raw.txt tests/fixtures/raw.txt

The .txt files are meant to be raw text files, with one sample (e.g. sentence) per line.

There are other parameters (see zeldarose transformer --help for a comprehensive list), the one you are probably mostly interested in is --config, giving the path to a training config (for which we have examples/).

The parameters --pretrained-models, --tokenizer and --model-config are all fed directly to Huggingface's transformers and can be pretrained models names or local path.

Distributed training

This is somewhat tricky, you have several options

  • If you are running in a SLURM cluster use --strategy ddp and invoke via srun

    • You might want to preprocess your data first outside of the main compute allocation. The --profile option might be abused for that purpose, since it won't run a full training, but will run any data preprocessing you ask for. It might also be beneficial at this step to load a placeholder model such as RoBERTa-minuscule to avoid runnin out of memory, since the only thing that matter for this preprocessing is the tokenizer.
  • Otherwise you have two options

    • Run with --strategy ddp_spawn, which uses multiprocessing.spawn to start the process swarm (tested, but possibly slower and more limited, see pytorch-lightning doc)
    • Run with --strategy ddp and start with torch.distributed.launch with --use_env and --no_python (untested)

Other hints

  • Data management relies on 🤗 datasets and use their cache management system. To run in a clear environment, you might have to check the cache directory pointed to by theHF_DATASETS_CACHE environment variable.

Inspirations

Citation

@inproceedings{grobol:hal-04262806,
    TITLE = {{Zelda Rose: a tool for hassle-free training of transformer models}},
    AUTHOR = {Grobol, Lo{\"i}c},
    URL = {https://hal.science/hal-04262806},
    BOOKTITLE = {{3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)}},
    ADDRESS = {Singapore, Indonesia},
    YEAR = {2023},
    MONTH = Dec,
    PDF = {https://hal.science/hal-04262806/file/Zeldarose_OSS_EMNLP23.pdf},
    HAL_ID = {hal-04262806},
    HAL_VERSION = {v1},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zeldarose-0.11.0.tar.gz (31.8 kB view details)

Uploaded Source

Built Distribution

zeldarose-0.11.0-py3-none-any.whl (39.6 kB view details)

Uploaded Python 3

File details

Details for the file zeldarose-0.11.0.tar.gz.

File metadata

  • Download URL: zeldarose-0.11.0.tar.gz
  • Upload date:
  • Size: 31.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.4

File hashes

Hashes for zeldarose-0.11.0.tar.gz
Algorithm Hash digest
SHA256 c71b4744dfa16592bac461021607e75ac51efb371f9354f30f45a9c157dd194c
MD5 53fa84c0bcb5772a8ce308a295a2c3c0
BLAKE2b-256 e0404eeb83d31fe035c05170dae241e3d09b1fd9dc3989b295d7afa45d7126e5

See more details on using hashes here.

File details

Details for the file zeldarose-0.11.0-py3-none-any.whl.

File metadata

  • Download URL: zeldarose-0.11.0-py3-none-any.whl
  • Upload date:
  • Size: 39.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.4

File hashes

Hashes for zeldarose-0.11.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3f5da331686a795766ac0f2f89a7bad459ca75cd1d9e0e8ba34d72bc56e88935
MD5 232989071c458895c893c6fe3611a077
BLAKE2b-256 ca1dfad01abba6e8232e70b1d93b2fd5d9c07c43a7dc5d713f97ce64d584b1d5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page