Skip to main content

General purpose model trainer for PyTorch that is more flexible than it should be, by 🐸Coqui.

Project description

👟 Trainer

PyPI - License PyPI - Python Version PyPI - Version GithubActions GithubActions

An opinionated general purpose model trainer on PyTorch with a simple code base. Fork of the original, unmaintained repository. New PyPI package: coqui-tts-trainer

Installation

From PyPI:

pip install coqui-tts-trainer

From Github:

git clone https://github.com/idiap/coqui-ai-Trainer
cd coqui-ai-Trainer
pip install -e .

Implementing a model

Subclass and overload the functions in the TrainerModel()

Training a model with auto-optimization

See the MNIST example.

Training a model with advanced optimization

With 👟 you can define the whole optimization cycle as you want as the in GAN example below. It enables more under-the-hood control and flexibility for more advanced training loops.

You just have to use the scaled_backward() function to handle mixed precision training.

...

def optimize(self, batch, trainer):
    imgs, _ = batch

    # sample noise
    z = torch.randn(imgs.shape[0], 100)
    z = z.type_as(imgs)

    # train discriminator
    imgs_gen = self.generator(z)
    logits = self.discriminator(imgs_gen.detach())
    fake = torch.zeros(imgs.size(0), 1)
    fake = fake.type_as(imgs)
    loss_fake = trainer.criterion(logits, fake)

    valid = torch.ones(imgs.size(0), 1)
    valid = valid.type_as(imgs)
    logits = self.discriminator(imgs)
    loss_real = trainer.criterion(logits, valid)
    loss_disc = (loss_real + loss_fake) / 2

    # step dicriminator
    _, _ = self.scaled_backward(loss_disc, None, trainer, trainer.optimizer[0])

    if trainer.total_steps_done % trainer.grad_accum_steps == 0:
        trainer.optimizer[0].step()
        trainer.optimizer[0].zero_grad()

    # train generator
    imgs_gen = self.generator(z)

    valid = torch.ones(imgs.size(0), 1)
    valid = valid.type_as(imgs)

    logits = self.discriminator(imgs_gen)
    loss_gen = trainer.criterion(logits, valid)

    # step generator
    _, _ = self.scaled_backward(loss_gen, None, trainer, trainer.optimizer[1])
    if trainer.total_steps_done % trainer.grad_accum_steps == 0:
        trainer.optimizer[1].step()
        trainer.optimizer[1].zero_grad()
    return {"model_outputs": logits}, {"loss_gen": loss_gen, "loss_disc": loss_disc}

...

See the GAN training example with Gradient Accumulation

Training with Batch Size Finder

see the test script here for training with batch size finder.

The batch size finder starts at a default BS(defaults to 2048 but can also be user defined) and searches for the largest batch size that can fit on your hardware. you should expect for it to run multiple trainings until it finds it. to use it instead of calling trainer.fit() youll call trainer.fit_with_largest_batch_size(starting_batch_size=2048) with starting_batch_size being the batch the size you want to start the search with. very useful if you are wanting to use as much gpu mem as possible.

Training with DDP

$ python -m trainer.distribute --script path/to/your/train.py --gpus "0,1"

We don't use .spawn() to initiate multi-gpu training since it causes certain limitations.

  • Everything must the pickable.
  • .spawn() trains the model in subprocesses and the model in the main process is not updated.
  • DataLoader with N processes gets really slow when the N is large.

Training with Accelerate

Setting use_accelerate in TrainingArgs to True will enable training with Accelerate.

You can also use it for multi-gpu or distributed training.

CUDA_VISIBLE_DEVICES="0,1,2" accelerate launch --multi_gpu --num_processes 3 train_recipe_autoregressive_prompt.py

See the Accelerate docs.

Adding a callback

👟 Supports callbacks to customize your runs. You can either set callbacks in your model implementations or give them explicitly to the Trainer.

Please check trainer.utils.callbacks to see available callbacks.

Here is how you provide an explicit call back to a 👟Trainer object for weight reinitialization.

def my_callback(trainer):
    print(" > My callback was called.")

trainer = Trainer(..., callbacks={"on_init_end": my_callback})
trainer.fit()

Profiling example

  • Create the torch profiler as you like and pass it to the trainer.
    import torch
    profiler = torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
        on_trace_ready=torch.profiler.tensorboard_trace_handler("./profiler/"),
        record_shapes=True,
        profile_memory=True,
        with_stack=True,
    )
    prof = trainer.profile_fit(profiler, epochs=1, small_run=64)
    then run Tensorboard
    
  • Run the tensorboard.
    tensorboard --logdir="./profiler/"
    

Supported Experiment Loggers

To add a new logger, you must subclass BaseDashboardLogger and overload its functions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coqui_tts_trainer-0.1.6.tar.gz (62.9 kB view details)

Uploaded Source

Built Distribution

coqui_tts_trainer-0.1.6-py3-none-any.whl (56.5 kB view details)

Uploaded Python 3

File details

Details for the file coqui_tts_trainer-0.1.6.tar.gz.

File metadata

  • Download URL: coqui_tts_trainer-0.1.6.tar.gz
  • Upload date:
  • Size: 62.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for coqui_tts_trainer-0.1.6.tar.gz
Algorithm Hash digest
SHA256 8305f23e9a077bb553e0ee44aba17800397b226fa72abcf823420b765a4a73a3
MD5 6ca91189bc61dbcbe05b692c107d6890
BLAKE2b-256 9dbda73ae4060f7e876d94c4d06d8c2dedce7bb44e96388f27f879ddd7484274

See more details on using hashes here.

Provenance

The following attestation bundles were made for coqui_tts_trainer-0.1.6.tar.gz:

Publisher: pypi-release.yml on idiap/coqui-ai-Trainer

Attestations:

File details

Details for the file coqui_tts_trainer-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for coqui_tts_trainer-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 5daf45f3523b0e97a92d05c413b64bbb83349c0c65989dcdb21e0d0d23fd9414
MD5 bb5d4a3678e362935817e81c01ef0867
BLAKE2b-256 60375a09d11fa10effd9ce87b90e87bbb0e4fbe48c84837174ed2c95c7e67b27

See more details on using hashes here.

Provenance

The following attestation bundles were made for coqui_tts_trainer-0.1.6-py3-none-any.whl:

Publisher: pypi-release.yml on idiap/coqui-ai-Trainer

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page