Skip to main content

Transformers trainer submodule of GT4SD.

Project description

GT4SD's trainer submodule for HF transformers and PyTorch Lightning

Train Language Models via HuggingFace transformers and PyTorch Lightning.

Development setup & installation

Create any virtual or conda environment compatible with the specs in setup.cfg. Then run:

pip install -e ".[dev]" 

Perform training via the CLI command

GT4SD provides a trainer client based on the gt4sd-lm-trainer CLI command.

$ gt4sd-trainer-lm --help
usage: gt4sd-trainer-lm [-h] [--configuration_file CONFIGURATION_FILE]

optional arguments:
  -h, --help            show this help message and exit
  --configuration_file CONFIGURATION_FILE
                        Configuration file for the trainining. It can be used
                        to completely by-pass pipeline specific arguments.
                        (default: None)

To launch a training you have two options.

You can either specify the path of a configuration file that contains the needed training parameters:

gt4sd-trainer-lm  --training_pipeline_name ${TRAINING_PIPELINE_NAME} --configuration_file ${CONFIGURATION_FILE}

Or you can provide directly the needed parameters as arguments:

gt4sd-trainer-lm --type mlm --model_name_or_path mlm --training_file /path/to/train_file.jsonl --validation_file /path/to/valid_file.jsonl

Convert PyTorch Lightning checkpoints to HuggingFace model via the CLI command

Once a training pipeline has been run via the gt4sd-lm-trainer, it's possible to convert the PyTorch Lightning checkpoint to HugginFace model via gt4sd-pl-to-hf:

gt4sd-pl-to-hf --hf_model_path ${HF_MODEL_PATH} --training_type ${TRAINING_TYPE} --model_name_or_path ${MODEL_NAME_OR_PATH} --ckpt {CKPT} --tokenizer_name_or_path {TOKENIZER_NAME_OR_PATH}

References

If you use gt4sd in your projects, please consider citing the following:

@article{manica2022gt4sd,
  title={GT4SD: Generative Toolkit for Scientific Discovery},
  author={Manica, Matteo and Cadow, Joris and Christofidellis, Dimitrios and Dave, Ashish and Born, Jannis and Clarke, Dean and Teukam, Yves Gaetan Nana and Hoffman, Samuel C and Buchan, Matthew and Chenthamarakshan, Vijil and others},
  journal={arXiv preprint arXiv:2207.03928},
  year={2022}
}

License

The gt4sd codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gt4sd-trainer-hf-pl-0.0.3.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

gt4sd_trainer_hf_pl-0.0.3-py3-none-any.whl (27.4 kB view details)

Uploaded Python 3

File details

Details for the file gt4sd-trainer-hf-pl-0.0.3.tar.gz.

File metadata

  • Download URL: gt4sd-trainer-hf-pl-0.0.3.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for gt4sd-trainer-hf-pl-0.0.3.tar.gz
Algorithm Hash digest
SHA256 7f5c471e841a5d18a86f0da50c66129b80adb9a11c0cf1ef0744906f0f7eabd3
MD5 804868db1e3745d68dc6c2658a634b8b
BLAKE2b-256 5d9f71ea3cba0d5c0195a94989a4c650d0674da10d838e3188fd9aede5772468

See more details on using hashes here.

Provenance

File details

Details for the file gt4sd_trainer_hf_pl-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for gt4sd_trainer_hf_pl-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3f818efbc92c66056c6fc99c29abc63b0003f7700e96557bb1ec6b8a09069459
MD5 082a6d24836d803c4e1f9d166f671c2b
BLAKE2b-256 d8598f2704b806f1ffaea875a343c063b3053c2bdfd648feecd01259565e5a90

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page