General purpose model trainer for PyTorch that is more flexible than it should be, by 🐸Coqui.
Project description
👟 Trainer
An opinionated general purpose model trainer on PyTorch with a simple code base. Fork of the original, unmaintained repository. New PyPI package: coqui-tts-trainer
Installation
From PyPI:
pip install coqui-tts-trainer
From Github:
git clone https://github.com/idiap/coqui-ai-Trainer
cd coqui-ai-Trainer
pip install -e .
Implementing a model
Subclass and overload the functions in the TrainerModel()
Training a model with auto-optimization
See the MNIST example.
Training a model with advanced optimization
With 👟 you can define the whole optimization cycle as you want as the in GAN example below. It enables more under-the-hood control and flexibility for more advanced training loops.
You just have to use the scaled_backward()
function to handle mixed precision training.
...
def optimize(self, batch, trainer):
imgs, _ = batch
# sample noise
z = torch.randn(imgs.shape[0], 100)
z = z.type_as(imgs)
# train discriminator
imgs_gen = self.generator(z)
logits = self.discriminator(imgs_gen.detach())
fake = torch.zeros(imgs.size(0), 1)
fake = fake.type_as(imgs)
loss_fake = trainer.criterion(logits, fake)
valid = torch.ones(imgs.size(0), 1)
valid = valid.type_as(imgs)
logits = self.discriminator(imgs)
loss_real = trainer.criterion(logits, valid)
loss_disc = (loss_real + loss_fake) / 2
# step dicriminator
_, _ = self.scaled_backward(loss_disc, None, trainer, trainer.optimizer[0])
if trainer.total_steps_done % trainer.grad_accum_steps == 0:
trainer.optimizer[0].step()
trainer.optimizer[0].zero_grad()
# train generator
imgs_gen = self.generator(z)
valid = torch.ones(imgs.size(0), 1)
valid = valid.type_as(imgs)
logits = self.discriminator(imgs_gen)
loss_gen = trainer.criterion(logits, valid)
# step generator
_, _ = self.scaled_backward(loss_gen, None, trainer, trainer.optimizer[1])
if trainer.total_steps_done % trainer.grad_accum_steps == 0:
trainer.optimizer[1].step()
trainer.optimizer[1].zero_grad()
return {"model_outputs": logits}, {"loss_gen": loss_gen, "loss_disc": loss_disc}
...
See the GAN training example with Gradient Accumulation
Training with Batch Size Finder
see the test script here for training with batch size finder.
The batch size finder starts at a default BS(defaults to 2048 but can also be user defined) and searches for the largest batch size that can fit on your hardware. you should expect for it to run multiple trainings until it finds it. to use it instead of calling trainer.fit()
youll call trainer.fit_with_largest_batch_size(starting_batch_size=2048)
with starting_batch_size
being the batch the size you want to start the search with. very useful if you are wanting to use as much gpu mem as possible.
Training with DDP
$ python -m trainer.distribute --script path/to/your/train.py --gpus "0,1"
We don't use .spawn()
to initiate multi-gpu training since it causes certain limitations.
- Everything must the pickable.
.spawn()
trains the model in subprocesses and the model in the main process is not updated.- DataLoader with N processes gets really slow when the N is large.
Training with Accelerate
Setting use_accelerate
in TrainingArgs
to True
will enable training with Accelerate.
You can also use it for multi-gpu or distributed training.
CUDA_VISIBLE_DEVICES="0,1,2" accelerate launch --multi_gpu --num_processes 3 train_recipe_autoregressive_prompt.py
See the Accelerate docs.
Adding a callback
👟 Supports callbacks to customize your runs. You can either set callbacks in your model implementations or give them explicitly to the Trainer.
Please check trainer.utils.callbacks
to see available callbacks.
Here is how you provide an explicit call back to a 👟Trainer object for weight reinitialization.
def my_callback(trainer):
print(" > My callback was called.")
trainer = Trainer(..., callbacks={"on_init_end": my_callback})
trainer.fit()
Profiling example
- Create the torch profiler as you like and pass it to the trainer.
import torch profiler = torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ], schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2), on_trace_ready=torch.profiler.tensorboard_trace_handler("./profiler/"), record_shapes=True, profile_memory=True, with_stack=True, ) prof = trainer.profile_fit(profiler, epochs=1, small_run=64) then run Tensorboard
- Run the tensorboard.
tensorboard --logdir="./profiler/"
Supported Experiment Loggers
- Tensorboard - actively maintained
- ClearML - actively maintained
- MLFlow
- Aim
- WandDB
To add a new logger, you must subclass BaseDashboardLogger and overload its functions.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file coqui_tts_trainer-0.1.7.tar.gz
.
File metadata
- Download URL: coqui_tts_trainer-0.1.7.tar.gz
- Upload date:
- Size: 63.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a57651fe276022610b2da448c577e1617958eee7f63d7146a72b4bde4871c0ab |
|
MD5 | 45852e8d19e6bf072ea31cb7e0645d24 |
|
BLAKE2b-256 | 7809e62261dc1c38f4ba020bcece5b2563845d16792a88b628b1ceb122142d6e |
Provenance
The following attestation bundles were made for coqui_tts_trainer-0.1.7.tar.gz
:
Publisher:
pypi-release.yml
on idiap/coqui-ai-Trainer
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
coqui_tts_trainer-0.1.7.tar.gz
- Subject digest:
a57651fe276022610b2da448c577e1617958eee7f63d7146a72b4bde4871c0ab
- Sigstore transparency entry: 148687641
- Sigstore integration time:
- Predicate type:
File details
Details for the file coqui_tts_trainer-0.1.7-py3-none-any.whl
.
File metadata
- Download URL: coqui_tts_trainer-0.1.7-py3-none-any.whl
- Upload date:
- Size: 56.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9988f050591ad6d3ac920b63ce0249f48b2dbd0522dc59e878e28af0695d597f |
|
MD5 | 554bbd50be8af2349fec55ba70819829 |
|
BLAKE2b-256 | 0d0bc37eff47a7da9a72b7788a5e3eef9acb5ef17998e23d32198052458d882e |
Provenance
The following attestation bundles were made for coqui_tts_trainer-0.1.7-py3-none-any.whl
:
Publisher:
pypi-release.yml
on idiap/coqui-ai-Trainer
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
coqui_tts_trainer-0.1.7-py3-none-any.whl
- Subject digest:
9988f050591ad6d3ac920b63ce0249f48b2dbd0522dc59e878e28af0695d597f
- Sigstore transparency entry: 148687643
- Sigstore integration time:
- Predicate type: