PyTorch interface for TrueGrad-AdamW

These details have not been verified by PyPI

Project links

Homepage

Project description

TrueGrad

PyTorch interface for TrueGrad Optimizers

Getting Started

Installation

python3 -m pip install truegrad

Examples

TrueGrad supports various backends, each with their own tradeoffs:

Name	Advantages	Disadvantages
truegrad.nn	* What you see is what you get - Modules not in truegrad.nn and truegrad.nn.functional are not supported * Custom forward/backward for some fused functions * Optimized backward passes	* Limited applicability - custom modules can't be used * Requires code modification
truegrad.utils.patch_torch	* Uses truegrad.nn under the hood * Works for many (off-the-shelf!) torch models * No code modification necessary	* Uncertainty if model is compatible
backpack	* Highest stability * Loud warnings and errors * Battle-tested * Simple to extend further	* High memory usage * High compute usage * Sparse support for torch operations
truegrad.utils.patch_model	* Works with custom models	* Fails silently on fused functions * ~50% to 100% slower than truegrad.nn
patch_torch + patch_model	* Best compatibility * Reduced overheads compared to `patch_model` (by falling back to faster pre-patched `patch_torch` where available)	* Fails silently on fused functions outside of torch.nn * Slower than truegrad.nn when truegrad.nn would've been enough

Below, you'll find examples for each of these backends, as well as a general strategy allowing partial application of TrueGrad.

nn

The preferred method of using TrueGrad is by replacing torch.nn with performant truegrad.nn modules. While other methods add compute and memory overheads, truegrad.nn and truegrad.nn.functional have hand-crafted gradients. This is the most powerful method, although it requires code modifications.

import torch
from truegrad import nn
from truegrad.optim import TGAdamW

# define model by mixing truegrad.nn and torch.nn
model = torch.nn.Sequential(nn.Linear(1, 10),
                            nn.LayerNorm(10),
                            torch.nn.ReLU(),
                            nn.Linear(10, 1))
optim = TGAdamW(model.parameters())  # truegrad.optim.TGAdamW instead of torch.optim.AdamW

# standard training loop 
while True:
    input = torch.randn((16, 1))
    model(input).mean().backward()
    optim.step()
    optim.zero_grad()

Patch Torch

In some cases, you can't modify the model's source. For example, when importing models from torchvision. If that's the case, or if you simply want to try out TrueGrad, you can use truegrad.utils.patch_torch(), to replace torch.nn.Module's with truegrad.nn.Module's where possible. For example, the code below can be used to train a ResNet-18:

import torch
from torchvision.models import resnet18

from truegrad.optim import TGAdamW
from truegrad.utils import patch_torch

patch_torch()  # call before model creation, otherwise complete freedom
model = resnet18().cuda()
optim = TGAdamW(model.parameters(), lr=1e-7, weight_decay=0)

# constant input/output to overfit
inp = torch.randn((2, 3, 224, 224)).cuda()
tgt = torch.randint(0, 1000, (2,)).cuda()

# standard training loop
i = 0
while True:
    loss = torch.nn.functional.cross_entropy(model(inp), tgt)
    loss.backward()
    optim.step()
    optim.zero_grad()
    i += 1
    if i % 5 == 0:
        print(i, loss.item())

Similarly, most huggingface transformers work out of the box:

import torch
import transformers
from torch.nn import functional as F

from truegrad.optim import TGAdamW
from truegrad.utils import patch_torch

patch_torch()  # only added line to get truegrad statistics for TGAdamW

model = transformers.BertModel.from_pretrained("google/bert_uncased_L-2_H-128_A-2")  # any existing model
tokenizer = transformers.BertTokenizer.from_pretrained("google/bert_uncased_L-2_H-128_A-2")

optim = TGAdamW(model.parameters())

# constant input to overfit
input = tokenizer(["Hello World!"], return_tensors="pt")

# training loop as normal
while True:
    out = model(**input)
    loss = F.l1_loss(out[0], torch.ones_like(out[0]))
    loss.backward()
    optim.step()
    optim.zero_grad()
    print(loss.item())

Note that this works even though transformers have custom modules, which could cause issues. The key factor is that all parameters come from torch.nn.Module's, which are patched by patch_torch(). Therefore, truegrad handles all parameter usages. Therefore, any composition of torch.nn.Module's makes for a truegrad-compatible model.

BackPack

The most stable although also memory hungry method to compute TrueGrad statistics is to use BackPack. BackPack is a third-party library that automatically computes the sum of gradient squares and works for most models by implementing custom backward rules for many torch.nn.Module's.

import backpack
import torch
from torch.nn import CrossEntropyLoss
from truegrad.optim import TGAdamW
from torchvision.models import alexnet

model = alexnet()  # BatchNorm and in-place ops (like ResNet's residual path) aren't supported
optim = TGAdamW(model.parameters(), lr=1e-7, weight_decay=0)

# replace inplace ops like nn.ReLU(inplace=True) where possible
for mod in model.modules():
    if hasattr(mod, "inplace"):
        mod.inplace = False

# backpack relies on module-level pytorch hooks
model = backpack.extend(model)
lossfunc = backpack.extend(CrossEntropyLoss())

# constant input/output to overfit
inp = torch.randn((2, 3, 224, 224))
tgt = torch.randint(0, 1000, (2,))

# standard training loop
i = 0
while True:
    # "SumGradSquared" computes the sum of the squared gradient
    with backpack.backpack(backpack.extensions.SumGradSquared()):
        loss = lossfunc(model(inp), tgt)
        loss.backward()
    optim.step()
    optim.zero_grad()
    i += 1
    if i % 5 == 0:
        print(i, loss.item())

If you're using custom modules with self-defined parameters, this method will not work. Additionally, note that, if your model has any layer called .output or you're using PyTorch >= 1.13, you will need to install BackPack-HF via python3 -m pip install git+https://github.com/ClashLuke/backpack-hf.

Patch Custom Models

Another option to integrate TrueGrad into existing models is to patch them using truegrad.utils.patch_model(). patch_model() will go through all torch.nn.Module's in PyTorch model and convert their torch.nn.Parameter's to truegrad.nn.TrueGradParameter's. A TrueGradParameter acts largely the same as a torch.nn.Parameter, but adds required operations into the model's backward pass. Note that this doesn't give the most effective computation graph, but works well for many custom models.
Importantly, be aware that this does not work for fused functions, such as torch.nn.LayerNorm and torch.nn.MultiheadAttention. However, unfused functions which directly access a parameter, such as multiplication, work well. Therefore, torch.nn.Linear and HuggingFace's attention work as expected.

import torch
from truegrad.optim import TGAdamW
from truegrad.utils import patch_model
from torchvision.models import alexnet

model = alexnet()  # patch_model can't handle fused ops like VGG's and ResNet's BatchNorm
optim = TGAdamW(model.parameters())

# replace inplace ops like nn.ReLU(inplace=True) where possible
for mod in model.modules():
    if hasattr(mod, "inplace"):
        mod.inplace = False

patch_model(model)  # replace torch.nn.Parameter with truegrad.nn.Parameter

# constant input/output to overfit
inp = torch.randn((2, 3, 224, 224))
tgt = torch.randint(0, 1000, (2,))

# standard training loop
i = 0
while True:
    loss = torch.nn.functional.cross_entropy(model(inp), tgt)
    loss.backward()
    optim.step()
    optim.zero_grad()
    i += 1
    if i % 5 == 0:
        print(i, loss.item())

Full Patching

One way of avoiding truegrad.utils.patch_model's downsides when working with off-the-shelf models containing custom parameters, such as lucidrains' ViT's is to also patch_torch. This takes care of many fused functions, such as LayerNorm, while still allowing full flexibility in model design.

import torch
from vit_pytorch.levit import LeViT
from truegrad.utils import patch_torch, patch_model
from truegrad.optim import TGAdamW

patch_torch()  # before model instantiation

levit = LeViT(
        image_size=224,
        num_classes=1000,
        stages=3,  # number of stages
        dim=(256, 384, 512),  # dimensions at each stage
        depth=4,  # transformer of depth 4 at each stage
        heads=(4, 6, 8),  # heads at each stage
        mlp_mult=2,
        dropout=0.1
        )

opt = TGAdamW(levit.parameters())

patch_model(levit)  # replace torch.nn.Parameter with truegrad.nn.TrueGradParameter

# constant input to overfit
img = torch.randn(1, 3, 224, 224)

# standard training loop
while True:
    loss = levit(img).square().mean()
    loss.backward()
    opt.step()
    opt.zero_grad()
    print(loss.item())

Partial TrueGrad

Unfortunately, it's not always sensible to apply TrueGrad, as some backward passes are too slow, and sometimes it's impossible to avoid a fused function. Therefore, it can be an option to use TGAdamW only on specific subsections of the model. To do so, you can specify default_to_adam=True to TGAdamW. Adding this option allows TGAdamW to fall back to AdamW if there is no sum_grad_squared attribute available. For example, the code from #nn could be extended in the following way:

import torch
from truegrad import nn
from truegrad.optim import TGAdamW

model = torch.nn.Sequential(nn.Linear(1, 10),  # Weights coming from truegrad.nn 
                            nn.LayerNorm(10),
                            torch.nn.ReLU(),
                            torch.nn.Linear(10, 1))  # Weights coming torch.nn

optim = TGAdamW(model.parameters(), default_to_adam=True)

# standard training loop
i = 0
while True:
    input = torch.randn((16, 1))
    loss = model(input).mean()
    loss.backward()
    optim.step()
    optim.zero_grad()
    i += 1
    if i % 5 == 0:
        print(i, loss.item())

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

5.0.0

Feb 4, 2024

4.0.3

Jul 16, 2023

4.0.2

Apr 23, 2023

4.0.1

Apr 23, 2023

4.0.0

Apr 23, 2023

3.1.1

Mar 18, 2023

3.0.0

Mar 18, 2023

2.6.0

Mar 4, 2023

2.5.4

Feb 24, 2023

2.5.3

Feb 23, 2023

2.5.2

Feb 23, 2023

2.5.1

Feb 23, 2023

2.5.0

Feb 19, 2023

2.4.0

Jan 14, 2023

2.3.5

Jan 14, 2023

2.3.4

Jan 14, 2023

2.3.3

Jan 14, 2023

2.3.2

Jan 14, 2023

2.3.1

Jan 14, 2023

2.3.0

Jan 14, 2023

2.2.0

Jan 14, 2023

2.1.1

Nov 29, 2022

2.1.0

Nov 29, 2022

2.0.0

Nov 27, 2022

1.0.0

Nov 27, 2022

0.1.0

Nov 26, 2022

0.0.9

Nov 23, 2022

0.0.7

Nov 22, 2022

0.0.6

Nov 21, 2022

0.0.4

Nov 21, 2022

0.0.3

Nov 21, 2022

0.0.2

Nov 21, 2022

0.0.1

Nov 20, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

truegrad-5.0.0.tar.gz (30.6 kB view details)

Uploaded Feb 4, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

truegrad-5.0.0-py3-none-any.whl (28.5 kB view details)

Uploaded Feb 4, 2024 Python 3

File details

Details for the file truegrad-5.0.0.tar.gz.

File metadata

Download URL: truegrad-5.0.0.tar.gz
Upload date: Feb 4, 2024
Size: 30.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for truegrad-5.0.0.tar.gz
Algorithm	Hash digest
SHA256	`affdcb98d34eed2c6374744a16516223236a7e30a591797e402c89585d6628de`
MD5	`7d4594d49f2b0a8a79115c1ffd429dd7`
BLAKE2b-256	`5f8f69cd0e67431005417b6da4c1b077c7eb75fae7566468f84051c852fc91d5`

See more details on using hashes here.

File details

Details for the file truegrad-5.0.0-py3-none-any.whl.

File metadata

Download URL: truegrad-5.0.0-py3-none-any.whl
Upload date: Feb 4, 2024
Size: 28.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for truegrad-5.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d5ae61175b3ec8c95847ec13455dda91032d303374bcea1497e6be308b38d6b1`
MD5	`6b1395e99bfeedf247d6b1bdcee915a2`
BLAKE2b-256	`39acb07fb9b8e9d0a1bbbb09626050097c119096f4b0351964749fcf5ddf517c`

See more details on using hashes here.

truegrad 5.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TrueGrad

Getting Started

Installation

Examples

nn

Patch Torch

BackPack

Patch Custom Models

Full Patching

Partial TrueGrad

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes