Skip to main content

General purpose model trainer for PyTorch that is more flexible than it should be, by 🐸Coqui.

Project description

👟 Trainer

PyPI - License PyPI - Python Version PyPI - Version GithubActions GithubActions

An opinionated general purpose model trainer on PyTorch with a simple code base. Fork of the original, unmaintained repository. New PyPI package: coqui-tts-trainer

Installation

From PyPI:

pip install coqui-tts-trainer

From Github:

git clone https://github.com/idiap/coqui-ai-Trainer
cd coqui-ai-Trainer
pip install -e .

Implementing a model

Subclass and overload the functions in the TrainerModel()

Training a model with auto-optimization

See the MNIST example.

Training a model with advanced optimization

With 👟 you can define the whole optimization cycle as you want as the in GAN example below. It enables more under-the-hood control and flexibility for more advanced training loops.

You just have to use the scaled_backward() function to handle mixed precision training.

...

def optimize(self, batch, trainer):
    imgs, _ = batch

    # sample noise
    z = torch.randn(imgs.shape[0], 100)
    z = z.type_as(imgs)

    # train discriminator
    imgs_gen = self.generator(z)
    logits = self.discriminator(imgs_gen.detach())
    fake = torch.zeros(imgs.size(0), 1)
    fake = fake.type_as(imgs)
    loss_fake = trainer.criterion(logits, fake)

    valid = torch.ones(imgs.size(0), 1)
    valid = valid.type_as(imgs)
    logits = self.discriminator(imgs)
    loss_real = trainer.criterion(logits, valid)
    loss_disc = (loss_real + loss_fake) / 2

    # step dicriminator
    _, _ = self.scaled_backward(loss_disc, None, trainer, trainer.optimizer[0])

    if trainer.total_steps_done % trainer.grad_accum_steps == 0:
        trainer.optimizer[0].step()
        trainer.optimizer[0].zero_grad()

    # train generator
    imgs_gen = self.generator(z)

    valid = torch.ones(imgs.size(0), 1)
    valid = valid.type_as(imgs)

    logits = self.discriminator(imgs_gen)
    loss_gen = trainer.criterion(logits, valid)

    # step generator
    _, _ = self.scaled_backward(loss_gen, None, trainer, trainer.optimizer[1])
    if trainer.total_steps_done % trainer.grad_accum_steps == 0:
        trainer.optimizer[1].step()
        trainer.optimizer[1].zero_grad()
    return {"model_outputs": logits}, {"loss_gen": loss_gen, "loss_disc": loss_disc}

...

See the GAN training example with Gradient Accumulation

Training with Batch Size Finder

see the test script here for training with batch size finder.

The batch size finder starts at a default BS(defaults to 2048 but can also be user defined) and searches for the largest batch size that can fit on your hardware. you should expect for it to run multiple trainings until it finds it. to use it instead of calling trainer.fit() youll call trainer.fit_with_largest_batch_size(starting_batch_size=2048) with starting_batch_size being the batch the size you want to start the search with. very useful if you are wanting to use as much gpu mem as possible.

Training with DDP

$ python -m trainer.distribute --script path/to/your/train.py --gpus "0,1"

We don't use .spawn() to initiate multi-gpu training since it causes certain limitations.

  • Everything must the pickable.
  • .spawn() trains the model in subprocesses and the model in the main process is not updated.
  • DataLoader with N processes gets really slow when the N is large.

Training with Accelerate

Setting use_accelerate in TrainingArgs to True will enable training with Accelerate.

You can also use it for multi-gpu or distributed training.

CUDA_VISIBLE_DEVICES="0,1,2" accelerate launch --multi_gpu --num_processes 3 train_recipe_autoregressive_prompt.py

See the Accelerate docs.

Adding a callback

👟 Supports callbacks to customize your runs. You can either set callbacks in your model implementations or give them explicitly to the Trainer.

Please check trainer.utils.callbacks to see available callbacks.

Here is how you provide an explicit call back to a 👟Trainer object for weight reinitialization.

def my_callback(trainer):
    print(" > My callback was called.")

trainer = Trainer(..., callbacks={"on_init_end": my_callback})
trainer.fit()

Profiling example

  • Create the torch profiler as you like and pass it to the trainer.
    import torch
    profiler = torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
        on_trace_ready=torch.profiler.tensorboard_trace_handler("./profiler/"),
        record_shapes=True,
        profile_memory=True,
        with_stack=True,
    )
    prof = trainer.profile_fit(profiler, epochs=1, small_run=64)
    then run Tensorboard
    
  • Run the tensorboard.
    tensorboard --logdir="./profiler/"
    

Supported Experiment Loggers

To add a new logger, you must subclass BaseDashboardLogger and overload its functions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coqui_tts_trainer-0.1.7.tar.gz (63.4 kB view details)

Uploaded Source

Built Distribution

coqui_tts_trainer-0.1.7-py3-none-any.whl (56.5 kB view details)

Uploaded Python 3

File details

Details for the file coqui_tts_trainer-0.1.7.tar.gz.

File metadata

  • Download URL: coqui_tts_trainer-0.1.7.tar.gz
  • Upload date:
  • Size: 63.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for coqui_tts_trainer-0.1.7.tar.gz
Algorithm Hash digest
SHA256 a57651fe276022610b2da448c577e1617958eee7f63d7146a72b4bde4871c0ab
MD5 45852e8d19e6bf072ea31cb7e0645d24
BLAKE2b-256 7809e62261dc1c38f4ba020bcece5b2563845d16792a88b628b1ceb122142d6e

See more details on using hashes here.

Provenance

The following attestation bundles were made for coqui_tts_trainer-0.1.7.tar.gz:

Publisher: pypi-release.yml on idiap/coqui-ai-Trainer

Attestations:

File details

Details for the file coqui_tts_trainer-0.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for coqui_tts_trainer-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 9988f050591ad6d3ac920b63ce0249f48b2dbd0522dc59e878e28af0695d597f
MD5 554bbd50be8af2349fec55ba70819829
BLAKE2b-256 0d0bc37eff47a7da9a72b7788a5e3eef9acb5ef17998e23d32198052458d882e

See more details on using hashes here.

Provenance

The following attestation bundles were made for coqui_tts_trainer-0.1.7-py3-none-any.whl:

Publisher: pypi-release.yml on idiap/coqui-ai-Trainer

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page