Skip to main content

Transformers trainer submodule of GT4SD.

Project description

GT4SD's trainer submodule for HF transformers and PyTorch Lightning

Train Language Models via HuggingFace transformers and PyTorch Lightning.

Development setup & installation

Create any virtual or conda environment compatible with the specs in setup.cfg. Then run:

pip install -e ".[dev]" 

Perform training via the CLI command

GT4SD provides a trainer client based on the gt4sd-lm-trainer CLI command.

$ gt4sd-trainer-lm --help
usage: gt4sd-trainer-lm [-h] [--configuration_file CONFIGURATION_FILE]

optional arguments:
  -h, --help            show this help message and exit
  --configuration_file CONFIGURATION_FILE
                        Configuration file for the trainining. It can be used
                        to completely by-pass pipeline specific arguments.
                        (default: None)

To launch a training you have two options.

You can either specify the path of a configuration file that contains the needed training parameters:

gt4sd-trainer-lm  --training_pipeline_name ${TRAINING_PIPELINE_NAME} --configuration_file ${CONFIGURATION_FILE}

Or you can provide directly the needed parameters as arguments:

gt4sd-trainer-lm --type mlm --model_name_or_path mlm --training_file /path/to/train_file.jsonl --validation_file /path/to/valid_file.jsonl

Convert PyTorch Lightning checkpoints to HuggingFace model via the CLI command

Once a training pipeline has been run via the gt4sd-lm-trainer, it's possible to convert the PyTorch Lightning checkpoint to HugginFace model via gt4sd-pl-to-hf:

gt4sd-pl-to-hf --hf_model_path ${HF_MODEL_PATH} --training_type ${TRAINING_TYPE} --model_name_or_path ${MODEL_NAME_OR_PATH} --ckpt {CKPT} --tokenizer_name_or_path {TOKENIZER_NAME_OR_PATH}

References

If you use gt4sd in your projects, please consider citing the following:

@article{manica2022gt4sd,
  title={GT4SD: Generative Toolkit for Scientific Discovery},
  author={Manica, Matteo and Cadow, Joris and Christofidellis, Dimitrios and Dave, Ashish and Born, Jannis and Clarke, Dean and Teukam, Yves Gaetan Nana and Hoffman, Samuel C and Buchan, Matthew and Chenthamarakshan, Vijil and others},
  journal={arXiv preprint arXiv:2207.03928},
  year={2022}
}

License

The gt4sd codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gt4sd-trainer-hf-pl-0.0.2.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

gt4sd_trainer_hf_pl-0.0.2-py3-none-any.whl (27.4 kB view details)

Uploaded Python 3

File details

Details for the file gt4sd-trainer-hf-pl-0.0.2.tar.gz.

File metadata

  • Download URL: gt4sd-trainer-hf-pl-0.0.2.tar.gz
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for gt4sd-trainer-hf-pl-0.0.2.tar.gz
Algorithm Hash digest
SHA256 8fbd17fb7540658bcfe3ace27438d57ceebd56bf4df42fb2ec668053028e57fb
MD5 0140bb197d08c796cc14a19acfbad443
BLAKE2b-256 1a930bb059094b8d76f890eeaf5c96792acf41d5457711290cad21c9cbe750a1

See more details on using hashes here.

Provenance

File details

Details for the file gt4sd_trainer_hf_pl-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for gt4sd_trainer_hf_pl-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ce3fc03b922cb0296b7a0beda9ec5bf483221607ebfa7c961754a9b86234c2c6
MD5 78edc2e0096eb31297f847cf7f6c4f7c
BLAKE2b-256 96edcea9359e0e9c24c634eec8e20068caafa3383da4c66b85dd0c78b36e0556

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page