General purpose model trainer for PyTorch that is more flexible than it should be, by 🐸Coqui.

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Project description

👟 Trainer

An opinionated general purpose model trainer on PyTorch with a simple code base.

Installation

From Github:

git clone https://github.com/coqui-ai/Trainer
cd Trainer
make install

From PyPI:

pip install trainer

Prefer installing from Github as it is more stable.

Implementing a model

Subclass and overload the functions in the TrainerModel()

Training a model with auto-optimization

See the MNIST example.

Training a model with advanced optimization

With 👟 you can define the whole optimization cycle as you want as the in GAN example below. It enables more under-the-hood control and flexibility for more advanced training loops.

You just have to use the scaled_backward() function to handle mixed precision training.

...

def optimize(self, batch, trainer):
    imgs, _ = batch

    # sample noise
    z = torch.randn(imgs.shape[0], 100)
    z = z.type_as(imgs)

    # train discriminator
    imgs_gen = self.generator(z)
    logits = self.discriminator(imgs_gen.detach())
    fake = torch.zeros(imgs.size(0), 1)
    fake = fake.type_as(imgs)
    loss_fake = trainer.criterion(logits, fake)

    valid = torch.ones(imgs.size(0), 1)
    valid = valid.type_as(imgs)
    logits = self.discriminator(imgs)
    loss_real = trainer.criterion(logits, valid)
    loss_disc = (loss_real + loss_fake) / 2

    # step dicriminator
    _, _ = self.scaled_backward(loss_disc, None, trainer, trainer.optimizer[0])

    if trainer.total_steps_done % trainer.grad_accum_steps == 0:
        trainer.optimizer[0].step()
        trainer.optimizer[0].zero_grad()

    # train generator
    imgs_gen = self.generator(z)

    valid = torch.ones(imgs.size(0), 1)
    valid = valid.type_as(imgs)

    logits = self.discriminator(imgs_gen)
    loss_gen = trainer.criterion(logits, valid)

    # step generator
    _, _ = self.scaled_backward(loss_gen, None, trainer, trainer.optimizer[1])
    if trainer.total_steps_done % trainer.grad_accum_steps == 0:
        trainer.optimizer[1].step()
        trainer.optimizer[1].zero_grad()
    return {"model_outputs": logits}, {"loss_gen": loss_gen, "loss_disc": loss_disc}

...

See the GAN training example with Gradient Accumulation

Training with Batch Size Finder

see the test script here for training with batch size finder.

The batch size finder starts at a default BS(defaults to 2048 but can also be user defined) and searches for the largest batch size that can fit on your hardware. you should expect for it to run multiple trainings until it finds it. to use it instead of calling trainer.fit() youll call trainer.fit_with_largest_batch_size(starting_batch_size=2048) with starting_batch_size being the batch the size you want to start the search with. very useful if you are wanting to use as much gpu mem as possible.

Training with DDP

$ python -m trainer.distribute --script path/to/your/train.py --gpus "0,1"

We don't use .spawn() to initiate multi-gpu training since it causes certain limitations.

Everything must the pickable.
.spawn() trains the model in subprocesses and the model in the main process is not updated.
DataLoader with N processes gets really slow when the N is large.

Training with Accelerate

Setting use_accelerate in TrainingArgs to True will enable training with Accelerate.

You can also use it for multi-gpu or distributed training.

CUDA_VISIBLE_DEVICES="0,1,2" accelerate launch --multi_gpu --num_processes 3 train_recipe_autoregressive_prompt.py

See the Accelerate docs.

Adding a callback

👟 Supports callbacks to customize your runs. You can either set callbacks in your model implementations or give them explicitly to the Trainer.

Please check trainer.utils.callbacks to see available callbacks.

Here is how you provide an explicit call back to a 👟Trainer object for weight reinitialization.

def my_callback(trainer):
    print(" > My callback was called.")

trainer = Trainer(..., callbacks={"on_init_end": my_callback})
trainer.fit()

Profiling example

Create the torch profiler as you like and pass it to the trainer.

import torch
profiler = torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
    on_trace_ready=torch.profiler.tensorboard_trace_handler("./profiler/"),
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
)
prof = trainer.profile_fit(profiler, epochs=1, small_run=64)
then run Tensorboard

Run the tensorboard.
```
tensorboard --logdir="./profiler/"
```

Supported Experiment Loggers

Tensorboard - actively maintained
ClearML - actively maintained
MLFlow
Aim
WandDB

To add a new logger, you must subclass BaseDashboardLogger and overload its functions.

Anonymized Telemetry

We constantly seek to improve 🐸 for the community. To understand the community's needs better and address them accordingly, we collect stripped-down anonymized usage stats when you run the trainer.

Of course, if you don't want, you can opt out by setting the environment variable TRAINER_TELEMETRY=0.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.0.36

Dec 13, 2023

0.0.35

Dec 12, 2023

0.0.34

Dec 7, 2023

0.0.33

Dec 5, 2023

0.0.32

Nov 16, 2023

This version

0.0.31

Aug 14, 2023

0.0.30

Jul 31, 2023

0.0.29

Jul 22, 2023

0.0.28

Jul 18, 2023

0.0.27

Jun 22, 2023

0.0.26

Jun 9, 2023

0.0.25

Apr 10, 2023

0.0.24

Mar 6, 2023

0.0.23

Mar 6, 2023

0.0.22

Jan 23, 2023

0.0.21

Jan 9, 2023

0.0.20

Dec 24, 2022

0.0.19

Dec 13, 2022

0.0.18

Dec 7, 2022

0.0.17

Nov 21, 2022

0.0.16

Oct 17, 2022

0.0.15

Sep 12, 2022

0.0.14

Aug 15, 2022

0.0.13

Jul 12, 2022

0.0.12

May 30, 2022

0.0.11

May 9, 2022

0.0.10

Apr 29, 2022

0.0.9

Apr 27, 2022

0.0.8

Apr 14, 2022

0.0.7

Apr 13, 2022

0.0.6

Apr 12, 2022

0.0.5

Apr 7, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trainer-0.0.31.tar.gz (49.8 kB view hashes)

Uploaded Aug 14, 2023 Source

Built Distribution

trainer-0.0.31-py3-none-any.whl (50.6 kB view hashes)

Uploaded Aug 14, 2023 Python 3

Hashes for trainer-0.0.31.tar.gz

Hashes for trainer-0.0.31.tar.gz
Algorithm	Hash digest
SHA256	`93ce184f39dfeb339f4a5fff610ace456f09d4779b4133d94d9546c7011bfeee`
MD5	`c1b0c4b63f3713acb75bc2ea3b66c8ea`
BLAKE2b-256	`e8eea57b5bb0b51fd53f5d7d3122205a428da518cf508d48ea70aee7c03ab4e4`

Hashes for trainer-0.0.31-py3-none-any.whl

Hashes for trainer-0.0.31-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c15d7cc5dc048a9004a7cebc4595639c290c76f8248b65b51ab7e534c1bcce10`
MD5	`c4f02282c040f7a516fdc03610145c0f`
BLAKE2b-256	`149332ab47a46633c889b5980a6525e4dd74e2bc71864d8498bd9c6e1233b8b0`