Debug PyTorch code using PySnooper.
Project description
TorchSnooper
Status:
Deploy (only run on release):
Do you want to look at the shape/dtype/etc. of every step of you model, but tired of manually writing prints?
Are you bothered by errors like RuntimeError: Expected object of scalar type Double but got scalar type Float
, and want to quickly figure out the problem?
TorchSnooper is a PySnooper extension that helps you debugging these errors.
To use TorchSnooper, you just use it like using PySnooper. Remember to replace the pysnooper.snoop
with torchsnooper.snoop
in your code.
To install:
pip install torchsnooper
TorchSnooper also support snoop. To use TorchSnooper with snoop, simply execute:
torchsnooper.register_snoop()
or
torchsnooper.register_snoop(verbose=True)
at the beginning, and use snoop normally.
Example 1: Monitoring device and dtype
We're writing a simple function:
def myfunc(mask, x):
y = torch.zeros(6)
y.masked_scatter_(mask, x)
return y
and use it like below
mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda')
source = torch.tensor([1.0, 2.0, 3.0], device='cuda')
y = myfunc(mask, source)
The above code seems to be correct, but unfortunately, we are getting the following error:
RuntimeError: Expected object of backend CPU but got backend CUDA for argument #2 'mask'
What is the problem? Let's snoop it! Decorate our function with torchsnooper.snoop()
:
import torch
import torchsnooper
@torchsnooper.snoop()
def myfunc(mask, x):
y = torch.zeros(6)
y.masked_scatter_(mask, x)
return y
mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda')
source = torch.tensor([1.0, 2.0, 3.0], device='cuda')
y = myfunc(mask, source)
Run our script, and we will see:
Starting var:.. mask = tensor<(6,), int64, cuda:0>
Starting var:.. x = tensor<(3,), float32, cuda:0>
21:41:42.941668 call 5 def myfunc(mask, x):
21:41:42.941834 line 6 y = torch.zeros(6)
New var:....... y = tensor<(6,), float32, cpu>
21:41:42.943443 line 7 y.masked_scatter_(mask, x)
21:41:42.944404 exception 7 y.masked_scatter_(mask, x)
Now pay attention to the devices of tensors, we notice
New var:....... y = tensor<(6,), float32, cpu>
Now, it's clear that, the problem is because y
is a tensor on CPU, that is,
we forget to specify the device on y = torch.zeros(6)
. Changing it to
y = torch.zeros(6, device='cuda')
, this problem is solved.
But when running the script again we are getting another error:
RuntimeError: Expected object of scalar type Byte but got scalar type Long for argument #2 'mask'
Look at the trace above again, pay attention to the dtype of variables, we notice
Starting var:.. mask = tensor<(6,), int64, cuda:0>
OK, the problem is that, we didn't make the mask
in the input a byte tensor. Changing the line into
mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda', dtype=torch.uint8)
Problem solved.
Example 2: Monitoring shape
We are building a linear model
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(2, 1)
def forward(self, x):
return self.layer(x)
and we want to fit y = x1 + 2 * x2 + 3
, so we create a dataset:
x = torch.tensor([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
y = torch.tensor([3.0, 5.0, 4.0, 6.0])
We train our model on this dataset using SGD optimizer:
model = Model()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for _ in range(10):
optimizer.zero_grad()
pred = model(x)
squared_diff = (y - pred) ** 2
loss = squared_diff.mean()
print(loss.item())
loss.backward()
optimizer.step()
But unfortunately, the loss does not go down to a low enough number.
What's wrong? Let's snoop it! Putting the training loop inside snoop:
with torchsnooper.snoop():
for _ in range(100):
optimizer.zero_grad()
pred = model(x)
squared_diff = (y - pred) ** 2
loss = squared_diff.mean()
print(loss.item())
loss.backward()
optimizer.step()
Part of the trace looks like:
New var:....... x = tensor<(4, 2), float32, cpu>
New var:....... y = tensor<(4,), float32, cpu>
New var:....... model = Model( (layer): Linear(in_features=2, out_features=1, bias=True))
New var:....... optimizer = SGD (Parameter Group 0 dampening: 0 lr: 0....omentum: 0 nesterov: False weight_decay: 0)
22:27:01.024233 line 21 for _ in range(100):
New var:....... _ = 0
22:27:01.024439 line 22 optimizer.zero_grad()
22:27:01.024574 line 23 pred = model(x)
New var:....... pred = tensor<(4, 1), float32, cpu, grad>
22:27:01.026442 line 24 squared_diff = (y - pred) ** 2
New var:....... squared_diff = tensor<(4, 4), float32, cpu, grad>
22:27:01.027369 line 25 loss = squared_diff.mean()
New var:....... loss = tensor<(), float32, cpu, grad>
22:27:01.027616 line 26 print(loss.item())
22:27:01.027793 line 27 loss.backward()
22:27:01.050189 line 28 optimizer.step()
We notice that, y
has shape (4,)
, but pred
has shape (4, 1)
. As a result, squared_diff
has shape (4, 4)
due to broadcasting!
This is not the expected behavior, let's fix it: pred = model(x).squeeze()
, now everything looks good:
New var:....... x = tensor<(4, 2), float32, cpu>
New var:....... y = tensor<(4,), float32, cpu>
New var:....... model = Model( (layer): Linear(in_features=2, out_features=1, bias=True))
New var:....... optimizer = SGD (Parameter Group 0 dampening: 0 lr: 0....omentum: 0 nesterov: False weight_decay: 0)
22:28:19.778089 line 21 for _ in range(100):
New var:....... _ = 0
22:28:19.778293 line 22 optimizer.zero_grad()
22:28:19.778436 line 23 pred = model(x).squeeze()
New var:....... pred = tensor<(4,), float32, cpu, grad>
22:28:19.780250 line 24 squared_diff = (y - pred) ** 2
New var:....... squared_diff = tensor<(4,), float32, cpu, grad>
22:28:19.781099 line 25 loss = squared_diff.mean()
New var:....... loss = tensor<(), float32, cpu, grad>
22:28:19.781361 line 26 print(loss.item())
22:28:19.781537 line 27 loss.backward()
22:28:19.798983 line 28 optimizer.step()
And the final model converge to the desired values.
Example 3: Monitoring nan and inf
Let's say we have a model that output the likelihood of something. For this example, we will just use a mock:
class MockModel(torch.nn.Module):
def __init__(self):
super(MockModel, self).__init__()
self.unused = torch.nn.Linear(6, 4)
def forward(self, x):
return torch.tensor([0.0, 0.25, 0.9, 0.75]) + self.unused(x) * 0.0
model = MockModel()
During training, we want to minimize the negative log likelihood, we have code:
for epoch in range(100):
batch_input = torch.randn(6, 6)
likelihood = model(batch_input)
log_likelihood = likelihood.log()
target = -log_likelihood.mean()
print(target.item())
optimizer.zero_grad()
target.backward()
optimizer.step()
Unfortunately, we first get inf
then nan
for our target during training. What's wrong? Let's snoop it:
with torchsnooper.snoop():
for epoch in range(100):
batch_input = torch.randn(6, 6)
likelihood = model(batch_input)
log_likelihood = likelihood.log()
target = -log_likelihood.mean()
print(target.item())
optimizer.zero_grad()
target.backward()
optimizer.step()
We will see the part of the output of the snoop looks like:
19:30:20.928316 line 18 for epoch in range(100):
New var:....... epoch = 0
19:30:20.928575 line 19 batch_input = torch.randn(6, 6)
New var:....... batch_input = tensor<(6, 6), float32, cpu>
19:30:20.929671 line 20 likelihood = model(batch_input)
New var:....... likelihood = tensor<(6, 4), float32, cpu, grad>
19:30:20.930284 line 21 log_likelihood = likelihood.log()
New var:....... log_likelihood = tensor<(6, 4), float32, cpu, grad, has_inf>
19:30:20.930672 line 22 target = -log_likelihood.mean()
New var:....... target = tensor<(), float32, cpu, grad, has_inf>
19:30:20.931136 line 23 print(target.item())
19:30:20.931508 line 25 optimizer.zero_grad()
19:30:20.931871 line 26 target.backward()
inf
19:30:20.960028 line 27 optimizer.step()
19:30:20.960673 line 18 for epoch in range(100):
Modified var:.. epoch = 1
19:30:20.961043 line 19 batch_input = torch.randn(6, 6)
19:30:20.961423 line 20 likelihood = model(batch_input)
Modified var:.. likelihood = tensor<(6, 4), float32, cpu, grad, has_nan>
19:30:20.961910 line 21 log_likelihood = likelihood.log()
Modified var:.. log_likelihood = tensor<(6, 4), float32, cpu, grad, has_nan>
19:30:20.962302 line 22 target = -log_likelihood.mean()
Modified var:.. target = tensor<(), float32, cpu, grad, has_nan>
19:30:20.962715 line 23 print(target.item())
19:30:20.963089 line 25 optimizer.zero_grad()
19:30:20.963464 line 26 target.backward()
19:30:20.964051 line 27 optimizer.step()
Reading the output, we find that, at the first epoch (epoch = 0
), the log_likelihood
has has_inf
flag.
The has_inf
flag means, your tensor contains inf
in its value. The same flag appears for target
.
And at the second epoch, starting from likelihood
, tensors all have a has_nan
flag.
From our experience with deep learning, we would guess this is because the first epoch has inf
, which causes
the gradient to be nan
, and when parameters are updated, these nan
propagate to parameters and causing all
future steps to have nan
result.
Taking a deeper look, we figure out that the likelihood
contains 0 in it, which leads to log(0) = -inf
. Changing
the line
log_likelihood = likelihood.log()
into
log_likelihood = likelihood.clamp(min=1e-8).log()
Problem solved.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file TorchSnooper-0.8.linux-x86_64.tar.gz
.
File metadata
- Download URL: TorchSnooper-0.8.linux-x86_64.tar.gz
- Upload date:
- Size: 9.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | baa441da053f9ce943f2862c3fcf983f90b23f0006ed7cafe7403606e5067de2 |
|
MD5 | b80acd6ae1342b1a351d45c2ac59cab2 |
|
BLAKE2b-256 | 4f787a8da20a9bb5883825ba3ce0bd4f646dd1d7268e9d0a8fc9f2b3488508a1 |
File details
Details for the file TorchSnooper-0.8-py3-none-any.whl
.
File metadata
- Download URL: TorchSnooper-0.8-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b5c7a8cd6952c4b9dada9a9502baa9ef28410747c97e6702eef1330b715115a |
|
MD5 | bf923c0e21a17bf8001ab80a62953616 |
|
BLAKE2b-256 | 8c317ce9965aa20d00394733dc3614b929065f402fcbcbee6fba300d0df3b061 |