Skip to main content

Debug PyTorch code using PySnooper.

Project description

TorchSnooper

Status:

PyPI PyPI - Downloads Actions Status Actions Status

Deploy (only run on release):

Actions Status

Do you want to look at the shape/dtype/etc. of every step of you model, but tired of manually writing prints?

Are you bothered by errors like RuntimeError: Expected object of scalar type Double but got scalar type Float, and want to quickly figure out the problem?

TorchSnooper is a PySnooper extension that helps you debugging these errors.

To use TorchSnooper, you just use it like using PySnooper. Remember to replace the pysnooper.snoop with torchsnooper.snoop in your code.

To install:

pip install torchsnooper

TorchSnooper also support snoop. To use TorchSnooper with snoop, simply execute:

torchsnooper.register_snoop()

or

torchsnooper.register_snoop(verbose=True)

at the beginning, and use snoop normally.

Example 1: Monitoring device and dtype

We're writing a simple function:

def myfunc(mask, x):
    y = torch.zeros(6)
    y.masked_scatter_(mask, x)
    return y

and use it like below

mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda')
source = torch.tensor([1.0, 2.0, 3.0], device='cuda')
y = myfunc(mask, source)

The above code seems to be correct, but unfortunately, we are getting the following error:

RuntimeError: Expected object of backend CPU but got backend CUDA for argument #2 'mask'

What is the problem? Let's snoop it! Decorate our function with torchsnooper.snoop():

import torch
import torchsnooper

@torchsnooper.snoop()
def myfunc(mask, x):
    y = torch.zeros(6)
    y.masked_scatter_(mask, x)
    return y

mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda')
source = torch.tensor([1.0, 2.0, 3.0], device='cuda')
y = myfunc(mask, source)

Run our script, and we will see:

Starting var:.. mask = tensor<(6,), int64, cuda:0>
Starting var:.. x = tensor<(3,), float32, cuda:0>
21:41:42.941668 call         5 def myfunc(mask, x):
21:41:42.941834 line         6     y = torch.zeros(6)
New var:....... y = tensor<(6,), float32, cpu>
21:41:42.943443 line         7     y.masked_scatter_(mask, x)
21:41:42.944404 exception    7     y.masked_scatter_(mask, x)

Now pay attention to the devices of tensors, we notice

New var:....... y = tensor<(6,), float32, cpu>

Now, it's clear that, the problem is because y is a tensor on CPU, that is, we forget to specify the device on y = torch.zeros(6). Changing it to y = torch.zeros(6, device='cuda'), this problem is solved.

But when running the script again we are getting another error:

RuntimeError: Expected object of scalar type Byte but got scalar type Long for argument #2 'mask'

Look at the trace above again, pay attention to the dtype of variables, we notice

Starting var:.. mask = tensor<(6,), int64, cuda:0>

OK, the problem is that, we didn't make the mask in the input a byte tensor. Changing the line into

mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda', dtype=torch.uint8)

Problem solved.

Example 2: Monitoring shape

We are building a linear model

class Model(torch.nn.Module):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(2, 1)

    def forward(self, x):
        return self.layer(x)

and we want to fit y = x1 + 2 * x2 + 3, so we create a dataset:

x = torch.tensor([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
y = torch.tensor([3.0, 5.0, 4.0, 6.0])

We train our model on this dataset using SGD optimizer:

model = Model()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for _ in range(10):
    optimizer.zero_grad()
    pred = model(x)
    squared_diff = (y - pred) ** 2
    loss = squared_diff.mean()
    print(loss.item())
    loss.backward()
    optimizer.step()

But unfortunately, the loss does not go down to a low enough number.

What's wrong? Let's snoop it! Putting the training loop inside snoop:

with torchsnooper.snoop():
    for _ in range(100):
        optimizer.zero_grad()
        pred = model(x)
        squared_diff = (y - pred) ** 2
        loss = squared_diff.mean()
        print(loss.item())
        loss.backward()
        optimizer.step()

Part of the trace looks like:

New var:....... x = tensor<(4, 2), float32, cpu>
New var:....... y = tensor<(4,), float32, cpu>
New var:....... model = Model(  (layer): Linear(in_features=2, out_features=1, bias=True))
New var:....... optimizer = SGD (Parameter Group 0    dampening: 0    lr: 0....omentum: 0    nesterov: False    weight_decay: 0)
22:27:01.024233 line        21     for _ in range(100):
New var:....... _ = 0
22:27:01.024439 line        22         optimizer.zero_grad()
22:27:01.024574 line        23         pred = model(x)
New var:....... pred = tensor<(4, 1), float32, cpu, grad>
22:27:01.026442 line        24         squared_diff = (y - pred) ** 2
New var:....... squared_diff = tensor<(4, 4), float32, cpu, grad>
22:27:01.027369 line        25         loss = squared_diff.mean()
New var:....... loss = tensor<(), float32, cpu, grad>
22:27:01.027616 line        26         print(loss.item())
22:27:01.027793 line        27         loss.backward()
22:27:01.050189 line        28         optimizer.step()

We notice that, y has shape (4,), but pred has shape (4, 1). As a result, squared_diff has shape (4, 4) due to broadcasting!

This is not the expected behavior, let's fix it: pred = model(x).squeeze(), now everything looks good:

New var:....... x = tensor<(4, 2), float32, cpu>
New var:....... y = tensor<(4,), float32, cpu>
New var:....... model = Model(  (layer): Linear(in_features=2, out_features=1, bias=True))
New var:....... optimizer = SGD (Parameter Group 0    dampening: 0    lr: 0....omentum: 0    nesterov: False    weight_decay: 0)
22:28:19.778089 line        21     for _ in range(100):
New var:....... _ = 0
22:28:19.778293 line        22         optimizer.zero_grad()
22:28:19.778436 line        23         pred = model(x).squeeze()
New var:....... pred = tensor<(4,), float32, cpu, grad>
22:28:19.780250 line        24         squared_diff = (y - pred) ** 2
New var:....... squared_diff = tensor<(4,), float32, cpu, grad>
22:28:19.781099 line        25         loss = squared_diff.mean()
New var:....... loss = tensor<(), float32, cpu, grad>
22:28:19.781361 line        26         print(loss.item())
22:28:19.781537 line        27         loss.backward()
22:28:19.798983 line        28         optimizer.step()

And the final model converge to the desired values.

Example 3: Monitoring nan and inf

Let's say we have a model that output the likelihood of something. For this example, we will just use a mock:

class MockModel(torch.nn.Module):

    def __init__(self):
        super(MockModel, self).__init__()
        self.unused = torch.nn.Linear(6, 4)

    def forward(self, x):
        return torch.tensor([0.0, 0.25, 0.9, 0.75]) + self.unused(x) * 0.0

model = MockModel()

During training, we want to minimize the negative log likelihood, we have code:

for epoch in range(100):
    batch_input = torch.randn(6, 6)
    likelihood = model(batch_input)
    log_likelihood = likelihood.log()
    target = -log_likelihood.mean()
    print(target.item())

    optimizer.zero_grad()
    target.backward()
    optimizer.step()

Unfortunately, we first get inf then nan for our target during training. What's wrong? Let's snoop it:

with torchsnooper.snoop():
    for epoch in range(100):
        batch_input = torch.randn(6, 6)
        likelihood = model(batch_input)
        log_likelihood = likelihood.log()
        target = -log_likelihood.mean()
        print(target.item())

        optimizer.zero_grad()
        target.backward()
        optimizer.step()

We will see the part of the output of the snoop looks like:

19:30:20.928316 line        18     for epoch in range(100):
New var:....... epoch = 0
19:30:20.928575 line        19         batch_input = torch.randn(6, 6)
New var:....... batch_input = tensor<(6, 6), float32, cpu>
19:30:20.929671 line        20         likelihood = model(batch_input)
New var:....... likelihood = tensor<(6, 4), float32, cpu, grad>
19:30:20.930284 line        21         log_likelihood = likelihood.log()
New var:....... log_likelihood = tensor<(6, 4), float32, cpu, grad, has_inf>
19:30:20.930672 line        22         target = -log_likelihood.mean()
New var:....... target = tensor<(), float32, cpu, grad, has_inf>
19:30:20.931136 line        23         print(target.item())
19:30:20.931508 line        25         optimizer.zero_grad()
19:30:20.931871 line        26         target.backward()
inf
19:30:20.960028 line        27         optimizer.step()
19:30:20.960673 line        18     for epoch in range(100):
Modified var:.. epoch = 1
19:30:20.961043 line        19         batch_input = torch.randn(6, 6)
19:30:20.961423 line        20         likelihood = model(batch_input)
Modified var:.. likelihood = tensor<(6, 4), float32, cpu, grad, has_nan>
19:30:20.961910 line        21         log_likelihood = likelihood.log()
Modified var:.. log_likelihood = tensor<(6, 4), float32, cpu, grad, has_nan>
19:30:20.962302 line        22         target = -log_likelihood.mean()
Modified var:.. target = tensor<(), float32, cpu, grad, has_nan>
19:30:20.962715 line        23         print(target.item())
19:30:20.963089 line        25         optimizer.zero_grad()
19:30:20.963464 line        26         target.backward()
19:30:20.964051 line        27         optimizer.step()

Reading the output, we find that, at the first epoch (epoch = 0), the log_likelihood has has_inf flag. The has_inf flag means, your tensor contains inf in its value. The same flag appears for target. And at the second epoch, starting from likelihood, tensors all have a has_nan flag.

From our experience with deep learning, we would guess this is because the first epoch has inf, which causes the gradient to be nan, and when parameters are updated, these nan propagate to parameters and causing all future steps to have nan result.

Taking a deeper look, we figure out that the likelihood contains 0 in it, which leads to log(0) = -inf. Changing the line

log_likelihood = likelihood.log()

into

log_likelihood = likelihood.clamp(min=1e-8).log()

Problem solved.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TorchSnooper-0.8.linux-x86_64.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

TorchSnooper-0.8-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file TorchSnooper-0.8.linux-x86_64.tar.gz.

File metadata

  • Download URL: TorchSnooper-0.8.linux-x86_64.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.7.5

File hashes

Hashes for TorchSnooper-0.8.linux-x86_64.tar.gz
Algorithm Hash digest
SHA256 baa441da053f9ce943f2862c3fcf983f90b23f0006ed7cafe7403606e5067de2
MD5 b80acd6ae1342b1a351d45c2ac59cab2
BLAKE2b-256 4f787a8da20a9bb5883825ba3ce0bd4f646dd1d7268e9d0a8fc9f2b3488508a1

See more details on using hashes here.

File details

Details for the file TorchSnooper-0.8-py3-none-any.whl.

File metadata

  • Download URL: TorchSnooper-0.8-py3-none-any.whl
  • Upload date:
  • Size: 7.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.7.5

File hashes

Hashes for TorchSnooper-0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 5b5c7a8cd6952c4b9dada9a9502baa9ef28410747c97e6702eef1330b715115a
MD5 bf923c0e21a17bf8001ab80a62953616
BLAKE2b-256 8c317ce9965aa20d00394733dc3614b929065f402fcbcbee6fba300d0df3b061

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page