Skip to main content

Transformers trainer submodule of GT4SD.

Project description

GT4SD's trainer submodule for HF transformers and PyTorch Lightning

Train Language Models via HuggingFace transformers and PyTorch Lightning.

Development setup & installation

Create any virtual or conda environment compatible with the specs in setup.cfg. Then run:

pip install -e ".[dev]" 

Perform training via the CLI command

GT4SD provides a trainer client based on the gt4sd-lm-trainer CLI command.

$ gt4sd-trainer-lm --help
usage: gt4sd-trainer-lm [-h] [--configuration_file CONFIGURATION_FILE]

optional arguments:
  -h, --help            show this help message and exit
  --configuration_file CONFIGURATION_FILE
                        Configuration file for the trainining. It can be used
                        to completely by-pass pipeline specific arguments.
                        (default: None)

To launch a training you have two options.

You can either specify the path of a configuration file that contains the needed training parameters:

gt4sd-trainer-lm  --training_pipeline_name ${TRAINING_PIPELINE_NAME} --configuration_file ${CONFIGURATION_FILE}

Or you can provide directly the needed parameters as arguments:

gt4sd-trainer-lm --type mlm --model_name_or_path mlm --training_file /path/to/train_file.jsonl --validation_file /path/to/valid_file.jsonl

Convert PyTorch Lightning checkpoints to HuggingFace model via the CLI command

Once a training pipeline has been run via the gt4sd-lm-trainer, it's possible to convert the PyTorch Lightning checkpoint to HugginFace model via gt4sd-pl-to-hf:

gt4sd-pl-to-hf --hf_model_path ${HF_MODEL_PATH} --training_type ${TRAINING_TYPE} --model_name_or_path ${MODEL_NAME_OR_PATH} --ckpt {CKPT} --tokenizer_name_or_path {TOKENIZER_NAME_OR_PATH}

References

If you use gt4sd in your projects, please consider citing the following:

@article{manica2022gt4sd,
  title={GT4SD: Generative Toolkit for Scientific Discovery},
  author={Manica, Matteo and Cadow, Joris and Christofidellis, Dimitrios and Dave, Ashish and Born, Jannis and Clarke, Dean and Teukam, Yves Gaetan Nana and Hoffman, Samuel C and Buchan, Matthew and Chenthamarakshan, Vijil and others},
  journal={arXiv preprint arXiv:2207.03928},
  year={2022}
}

License

The gt4sd codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gt4sd-trainer-hf-pl-1.0.0.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

gt4sd_trainer_hf_pl-1.0.0-py3-none-any.whl (27.5 kB view details)

Uploaded Python 3

File details

Details for the file gt4sd-trainer-hf-pl-1.0.0.tar.gz.

File metadata

  • Download URL: gt4sd-trainer-hf-pl-1.0.0.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for gt4sd-trainer-hf-pl-1.0.0.tar.gz
Algorithm Hash digest
SHA256 3fc457e7296e3824ca8e3dd1be857cde8a0f51e890d13170774f2483b230e21a
MD5 4d69e2378bb68b09c556291887f66416
BLAKE2b-256 86f7bbf031516c47b5709ae40343bcab0c532044315cfed593ec67f868e21d2c

See more details on using hashes here.

Provenance

File details

Details for the file gt4sd_trainer_hf_pl-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for gt4sd_trainer_hf_pl-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 95fe13e951eb4e3009262b041437bbb92fbe9c5f34d99c1c95367fb1e1a6cf16
MD5 11d190cc1975d4eadfbc0f964b16481a
BLAKE2b-256 9c76460270a33c451846f0542e99b5dd06fae6af3a07b3a4771c9246196abd40

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page