Skip to main content

Debug PyTorch code using PySnooper.

Project description

TorchSnooper

Checks:

Build Status Build Status Build Status

Deploy (only succeed on tagged commits):

Build Status

Do you want to look at the shape/dtype/etc. of every step of you model, but tired of manually writing prints?

Are you bothered by errors like RuntimeError: Expected object of scalar type Double but got scalar type Float, and want to quickly figure out the problem?

TorchSnooper is a PySnooper extension that helps you debugging these errors.

To use TorchSnooper, you just use it like using PySnooper. Remember to replace the pysnooper.snoop with torchsnooper.snoop in your code.

To install:

pip install torchsnooper

Example 1

We're writing a simple function:

def myfunc(mask, x):
    y = torch.zeros(6)
    y.masked_scatter_(mask, x)
    return y

and use it like below

mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda')
source = torch.tensor([1.0, 2.0, 3.0], device='cuda')
y = myfunc(mask, source)

The above code seems to be correct, but unfortunately, we are getting the following error:

RuntimeError: Expected object of backend CPU but got backend CUDA for argument #2 'mask'

What is the problem? Let's snoop it! Decorate our function with torchsnooper.snoop():

import torch
import torchsnooper

@torchsnooper.snoop()
def myfunc(mask, x):
    y = torch.zeros(6)
    y.masked_scatter_(mask, x)
    return y

mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda')
source = torch.tensor([1.0, 2.0, 3.0], device='cuda')
y = myfunc(mask, source)

Run our script, and we will see:

Starting var:.. mask = tensor<(6,), int64, cuda:0>
Starting var:.. x = tensor<(3,), float32, cuda:0>
21:41:42.941668 call         5 def myfunc(mask, x):
21:41:42.941834 line         6     y = torch.zeros(6)
New var:....... y = tensor<(6,), float32, cpu>
21:41:42.943443 line         7     y.masked_scatter_(mask, x)
21:41:42.944404 exception    7     y.masked_scatter_(mask, x)

Now pay attention to the devices of tensors, we notice

New var:....... y = tensor<(6,), float32, cpu>

Now, it's clear that, the problem is because y is a tensor on CPU, that is, we forget to specify the device on y = torch.zeros(6). Changing it to y = torch.zeros(6, device='cuda'), this problem is solved.

But when running the script again we are getting another error:

RuntimeError: Expected object of scalar type Byte but got scalar type Long for argument #2 'mask'

Look at the trace above again, pay attention to the dtype of variables, we notice

Starting var:.. mask = tensor<(6,), int64, cuda:0>

OK, the problem is that, we didn't make the mask in the input a byte tensor. Changing the line into

mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda', dtype=torch.uint8)

Problem solved.

Example 2

We are building a linear model

class Model(torch.nn.Module):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(2, 1)

    def forward(self, x):
        return self.layer(x)

and we want to fit y = x1 + 2 * x2 + 3, so we create a dataset:

x = torch.tensor([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
y = torch.tensor([3.0, 5.0, 4.0, 6.0])

We train our model on this dataset using SGD optimizer:

model = Model()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for _ in range(10):
    optimizer.zero_grad()
    pred = model(x)
    squared_diff = (y - pred) ** 2
    loss = squared_diff.mean()
    print(loss.item())
    loss.backward()
    optimizer.step()

But unfortunately, the loss does not go down to a low enough number.

What's wrong? Let's snoop it! Putting the training loop inside snoop:

with torchsnooper.snoop():
    for _ in range(100):
        optimizer.zero_grad()
        pred = model(x)
        squared_diff = (y - pred) ** 2
        loss = squared_diff.mean()
        print(loss.item())
        loss.backward()
        optimizer.step()

Part of the trace looks like:

New var:....... x = tensor<(4, 2), float32, cpu>
New var:....... y = tensor<(4,), float32, cpu>
New var:....... model = Model(  (layer): Linear(in_features=2, out_features=1, bias=True))
New var:....... optimizer = SGD (Parameter Group 0    dampening: 0    lr: 0....omentum: 0    nesterov: False    weight_decay: 0)
22:27:01.024233 line        21     for _ in range(100):
New var:....... _ = 0
22:27:01.024439 line        22         optimizer.zero_grad()
22:27:01.024574 line        23         pred = model(x)
New var:....... pred = tensor<(4, 1), float32, cpu, grad>
22:27:01.026442 line        24         squared_diff = (y - pred) ** 2
New var:....... squared_diff = tensor<(4, 4), float32, cpu, grad>
22:27:01.027369 line        25         loss = squared_diff.mean()
New var:....... loss = tensor<(), float32, cpu, grad>
22:27:01.027616 line        26         print(loss.item())
22:27:01.027793 line        27         loss.backward()
22:27:01.050189 line        28         optimizer.step()

We notice that, y has shape (4,), but pred has shape (4, 1). As a result, squared_diff has shape (4, 4) due to broadcasting!

This is not the expected behavior, let's fix it: pred = model(x).squeeze(), now everything looks good:

New var:....... x = tensor<(4, 2), float32, cpu>
New var:....... y = tensor<(4,), float32, cpu>
New var:....... model = Model(  (layer): Linear(in_features=2, out_features=1, bias=True))
New var:....... optimizer = SGD (Parameter Group 0    dampening: 0    lr: 0....omentum: 0    nesterov: False    weight_decay: 0)
22:28:19.778089 line        21     for _ in range(100):
New var:....... _ = 0
22:28:19.778293 line        22         optimizer.zero_grad()
22:28:19.778436 line        23         pred = model(x).squeeze()
New var:....... pred = tensor<(4,), float32, cpu, grad>
22:28:19.780250 line        24         squared_diff = (y - pred) ** 2
New var:....... squared_diff = tensor<(4,), float32, cpu, grad>
22:28:19.781099 line        25         loss = squared_diff.mean()
New var:....... loss = tensor<(), float32, cpu, grad>
22:28:19.781361 line        26         print(loss.item())
22:28:19.781537 line        27         loss.backward()
22:28:19.798983 line        28         optimizer.step()

And the final model converge to the desired values.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TorchSnooper-0.4.1.linux-x86_64.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

TorchSnooper-0.4.1-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file TorchSnooper-0.4.1.linux-x86_64.tar.gz.

File metadata

  • Download URL: TorchSnooper-0.4.1.linux-x86_64.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.2

File hashes

Hashes for TorchSnooper-0.4.1.linux-x86_64.tar.gz
Algorithm Hash digest
SHA256 ca20917a03d2dca23c4474d28a635981ec5917d80d0d34d5e4b72327ab8d40f5
MD5 596f855ea38106c26b8847e330121d26
BLAKE2b-256 b2b5754db71878deeb8bb591b3becaa19ae19df86a4e3efdaf26c684302373a3

See more details on using hashes here.

File details

Details for the file TorchSnooper-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: TorchSnooper-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 5.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.2

File hashes

Hashes for TorchSnooper-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 696e54b09cbc95d07b7c040f206fb0812308a37528c0dd77c1edb5cdd4c12480
MD5 47950d50797b57227e74ab677b8a5fdc
BLAKE2b-256 e162bd535c0b1533be9e573cfd753a4b89c2e643c6716871798d44425d234885

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page