PyTorch 101 Part 1: Understanding Graphs, Automatic Differentiation and Autograd

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了PyTorch 101 Part 1: Understanding Graphs, Automatic Differentiation and Autograd相关的知识,希望对你有一定的参考价值。

参考技术A

PyTorch is one of the foremost python deep learning libraries out there. It\'s the go to choice for deep learning research, and as each days passes by, more and more companies and research labs are adopting this library.

In this series of tutorials, we will be introducing you to PyTorch, and how to make the best use of the libraries as well the ecosystem of tools built around it. We\'ll first cover the basic building blocks, and then move onto how you can quickly prototype custom architectures. We will finally conclude with a couple of posts on how to scale your code, and how to debug your code if things go awry.

Before we begin, let me remind you this part 3 of our PyTorch series.

You can get all the code in this post, (and other posts as well) in the Github repo here .

You can get all the code in this post, (and other posts as well) in the Github repo here .

A lot of tutorial series on PyTorch would start begin with a rudimentary discussion of what the basic structures are. However, I\'d like to instead start by discussing automatic differentiation first.

Automatic Differentiation is a building block of not only PyTorch, but every DL library out there. In my opinion, PyTorch\'s automatic differentiation engine, called Autograd is a brilliant tool to understand how automatic differentiation works. This will not only help you understand PyTorch better, but also other DL libraries.

Modern neural network architectures can have millions of learnable parameters. From a computational point of view, training a neural network consists of two phases:

The forward pass is pretty straight forward. The output of one layer is the input to the next and so forth.

Backward pass is a bit more complicated since it requires us to use the chain rule to compute the gradients of weights w.r.t to the loss function.

Let us take an very simple neural network consisting of just 5 neurons. Our neural network looks like the following.

<figcaption style="box-sizing: inherit; margin: 1em 0px 0px; padding: 0px; border: 0px; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: 1.6em; font-family: inherit; font-size: 17.6px; vertical-align: baseline; text-align: center;">A Very Simple Neural Network</figcaption>

The following equations describe our neural network.

Let us compute the gradients for each of the learnable parameters .

All these gradients have been computed by applying the chain rule. Note that all the individual gradients on the right hand side of the equations mentioned above can be computed directly since the *numerators *of the gradients are explicit functions of the denominators.

We could manually compute the gradients of our network as it was very simple. Imagine, what if you had a network with 152 layers. Or, if the network had multiple branches.

When we design software to implement neural networks, we want to come up with a way that can allow us to seamlessly compute the gradients, regardless of the architecture type so that the programmer doesn\'t have to manually compute gradients when changes are made to the network.

We galvanise this idea in form of a data structure called a Computation graph . A computation graph looks very similar to the diagram of the graph that we made in the image above. However, the nodes in a computation graph are basically operators . These operators are basically the mathematical operators except for one case, where we need to represent creation of a user-defined variable.

Notice that we have also denoted the leaf variables in the graph for sake of clarity. However, it should noted that they are not a part of the computation graph. What they represent in our graph is the special case for user-defined variables which we just covered as an exception.

<figcaption style="box-sizing: inherit; margin: 1em 0px 0px; padding: 0px; border: 0px; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: 1.6em; font-family: inherit; font-size: 17.6px; vertical-align: baseline; text-align: center;">Computation Graph for our very simple Neural Network</figcaption>

The variables, *b,c *and d are created as a result of mathematical operations, whereas variables *a, w1, w2, w3 and w4 *are initialised by the user itself. Since, they are not created by any mathematical operator, nodes corresponding to their creation is represented by their name itself. This is true for all the *leaf *nodes in the graph.

Now, we are ready to describe how we will compute gradients using a computation graph.

Each node of the computation graph, with the exception of leaf nodes, can be considered as a function which takes some inputs and produces an output. Consider the node of the graph which produces variable d from and . Therefore we can write,

<figcaption style="box-sizing: inherit; margin: 1em 0px 0px; padding: 0px; border: 0px; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: 1.6em; font-family: inherit; font-size: 17.6px; vertical-align: baseline; text-align: center;">d is output of function f(x,y) = x + y</figcaption>

Now, we can easily compute the gradient of the with respect to it\'s inputs, and (which are both 1 ). Now, label the edges coming into the nodes with their respective gradients like the following image.

<figcaption style="box-sizing: inherit; margin: 1em 0px 0px; padding: 0px; border: 0px; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: 1.6em; font-family: inherit; font-size: 17.6px; vertical-align: baseline; text-align: center;">Local Gradients</figcaption>

We do it for the entire graph. The graph looks like this.

<figcaption style="box-sizing: inherit; margin: 1em 0px 0px; padding: 0px; border: 0px; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: 1.6em; font-family: inherit; font-size: 17.6px; vertical-align: baseline; text-align: center;">Backpropagation in a Computational Graph</figcaption>

Following we describe the algorithm for computing derivative of any node in this graph with respect to the loss, . Let\'s say we want to compute the derivative, .

If you see, the product is precisely the same expression we derived using chain rule. If there is more than one path to a variable from *L *then, we multiply the edges along each path and then add them together. For example, is computed as

Now we get what a computational graph is, let\'s get back to PyTorch and understand how the above is implemented in PyTorch.

Tensor is a data structure which is a fundamental building block of PyTorch. Tensor s are pretty much like numpy arrays, except that unlike numpy, tensors are designed to take advantage of parallel computation capabilities of a GPU. A lot of Tensor syntax is similar to that of numpy arrays.

One it\'s own, Tensor is just like a numpy ndarray . A data structure that can let you do fast linear algebra options. If you want PyTorch to create a graph corresponding to these operations, you will have to set the requires_grad attribute of the Tensor to True.

The API can be a bit confusing here. There are multiple ways to initialise tensors in PyTorch. While some ways can let you explicitly define that the requires_grad in the constructor itself, others require you to set it manually after creation of the Tensor.

requires_grad is contagious. It means that when a Tensor is created by operating on other Tensor s, the requires_grad of the resultant Tensor would be set True given at least one of the tensors used for creation has it\'s requires_grad set to True .

Each Tensor has a something an attribute called grad_fn *, *which refers to the mathematical operator that create the variable. If requires_grad is set to False, grad_fn would be None.

In our example where, , d \'s grad function would be the addition operator, since *f *adds it\'s to input together. Notice, addition operator is also the node in our graph that output\'s d . If our Tensor is a leaf node (initialised by the user), then the grad_fn is also None.

If you run the code above, you get the following output.

One can use the member function is_leaf to determine whether a variable is a leaf Tensor or not.

All mathematical operations in PyTorch are implemented by the torch.nn.Autograd.Function class. This class has two important member functions we need to look at.

The first is it\'s *forward *function, which simply computes the output using it\'s inputs.

The backward function takes the incoming gradient coming from the the part of the network in front of it. As you can see, the gradient to be backpropagated from a function *f *is basically the **gradient that is backpropagated to f from the layers in front of it **multiplied by the local gradient of the output of f with respect to it\'s inputs . This is exactly what the backward function does.

Let\'s again understand with our example of

Algorithmically, here\'s how backpropagation happens with a computation graph. (Not the actual implementation, only representative)

Here, self.Tensor is basically the Tensor created by Autograd.Function, which was d in our example.

Incoming gradients and local gradients have been described above.

In order to compute derivatives in our neural network, we generally call backward on the Tensor representing our loss. Then, we backtrack through the graph starting from node representing the grad_fn of our loss.

As described above, the backward function is recursively called through out the graph as we backtrack. Once, we reach a leaf node, since the grad_fn is None, but stop backtracking through that path.

One thing to note here is that PyTorch gives an error if you call backward() on vector-valued Tensor. This means you can only call backward on a scalar valued Tensor. In our example, if we assume a to be a vector valued Tensor, and call backward on L, it will throw up an error.

Running the above snippet results in the following error.

This is because gradients can be computed with respect to scalar values by definition. You can\'t exactly differentiate a vector with respect to another vector. The mathematical entity used for such cases is called a **Jacobian, **the discussion of which is beyond the scope of this article.

There are two ways to overcome this.

If you just make a small change in the above code setting L to be the sum of all the errors, our problem will be solved.

Once that\'s done, you can access the gradients by calling the grad attribute of Tensor .

Second way is, for some reason have to absolutely call backward on a vector function, you can pass a torch.ones of size of shape of the tensor you are trying to call backward with.

Notice how backward used to take incoming gradients as it\'s input. Doing the above makes the backward think that incoming gradient are just Tensor of ones of same size as L, and it\'s able to backpropagate.

In this way, we can have gradients for every Tensor , and we can update them using Optimisation algorithm of our choice.

And so on.

PyTorch creates something called a ****Dynamic Computation Graph, ****which means that the graph is generated on the fly.

Until the forward function of a Variable is called, there exists no node for the Tensor ( it’s grad_fn ) in the graph.

The graph is created as a result of forward function of many Tensors being invoked. Only then, the buffers for the non-leaf nodes allocated for the graph and intermediate values (used for computing gradients later. When you call backward , as the gradients are computed, these buffers (for non-leaf variables) are essentially freed, and the graph is *destroyed *( In a sense, you can\'t backpropagate through it since the buffers holding values to compute the gradients are gone).

Next time, you will call forward on the same set of tensors, the leaf node buffers from the previous run will be shared, while the non-leaf nodes buffers will be created again.

If you call backward more than once on a graph with non-leaf nodes, you\'ll be met with the following error.

This is because the non-leaf buffers gets destroyed the first time backward() is called and hence, there’s no path to navigate to the leaves when backward is invoked the second time. You can undo this non-leaf buffer destroying behaviour by adding retain_graph = True argument to the backward function.

If you do the above, you will be able to backpropagate again through the same graph and the gradients will be accumulated, i.e. the next you backpropagate, the gradients will be added to those already stored in the previous back pass.

This is in contrast to the ****Static Computation Graphs****, used by TensorFlow where the graph is declared *****before***** running the program. Then the graph is "run" by feeding values to the predefined graph.

The dynamic graph paradigm allows you to make changes to your network architecture *during *runtime, as a graph is created only when a piece of code is run.

This means a graph may be redefined during the lifetime for a program since you don\'t have to define it beforehand.

This, however, is not possible with static graphs where graphs are created before running the program, and merely executed later.

Dynamic graphs also make debugging way easier since it\'s easier to locate the source of your error.

This is an attribute of the Tensor class. By default, it’s False. It comes handy when you have to freeze some layers, and stop them from updating parameters while training. You can simply set the requires_grad to False, and these Tensors won’t participate in the computation graph.

Thus, no gradient would be propagated to them, or to those layers which depend upon these layers for gradient flow requires_grad . When set to True, requires_grad is contagious meaning even if one operand of an operation has requires_grad set to True, so will the result.

When we are computing gradients, we need to cache input values, and intermediate features as they maybe required to compute the gradient later.

The gradient of w.r.t it\'s inputs and is and respectively. We need to store these values for gradient computation during the backward pass. This affects the memory footprint of the network.

While, we are performing inference, we don\'t compute gradients, and thus, don\'t need to store these values. Infact, no graph needs to be create during inference as it will lead to useless consumption of memory.

PyTorch offers a context manager, called torch.no_grad for this purpose.

No graph is defined for operations executed under this context manager.

Understanding how Autograd and computation graphs works can make life with PyTorch a whole lot easier. With our foundations rock solid, the next posts will detail how to create custom complex architectures, how to create custom data pipelines and more interesting stuff.

https://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/

PyTorch深度学习60分钟快速入门 Part3:神经网络

 

神经网络可以通过使用torch.nn包来构建。

既然你已经了解了autograd,而nn依赖于autograd来定义模型并对其求微分。一个nn.Module包含多个网络层,以及一个返回输出的方法forward(input)

例如,查看下图中的对数字图片分类的网络:

这是一个简单的前馈网络。它接受输入,并将输入依次通过多个层,然后给出输出结果。
对于神经网络来说,一个经典的训练过程包括以下步骤:

  • 定义一个包含一些可学习的参数(或权重)的神经网络
  • 对输入数据集进行迭代
  • 通过网络处理输入
  • 计算损失函数(即输出距离正确值差多远)
  • 将梯度传播回网络参数
  • 更新网络的权重,通常使用一个简单的更新规则:weight = weight - learning_rate * gradient

0x01 定义网络

下面,我们定义该网络:

import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features


net = Net()
print(net)

输出结果为:

Net(
  (conv1): Conv2d (1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d (6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120)
  (fc2): Linear(in_features=120, out_features=84)
  (fc3): Linear(in_features=84, out_features=10)
)

你只需要定义前向(forward)函数,而反向(backward)函数(梯度计算的位置)会使用autograd自动为你定义。你可以在前向函数中使用任何张量操作。

一个模型的可学习参数由net.parameters()返回。

params = list(net.parameters())
print(len(params))
print(params[0].size())  # conv1\'s .weight

输出结果:

10
torch.Size([6, 1, 5, 5])

前向输入是一个autograd.Variable,输出也是如此。注意:该网络(LeNet)的预期输入大小为32x32。要在MNIST数据集上使用该网络,需要将该数据集中的图片大小调整为32x32。

input = Variable(torch.randn(1, 1, 32, 32))
out = net(input)
print(out)

输出结果:

Variable containing:
 0.0023 -0.0613 -0.0397 -0.1123 -0.0397  0.0330 -0.0656 -0.1231  0.0412  0.0162
[torch.FloatTensor of size 1x10]

将所有参数的梯度缓冲区置为零,并使用随机梯度进行后向传播:

net.zero_grad()
out.backward(torch.randn(1, 10))

注意: torch.nn只支持小批量,整个torch.nn包都只支持小批量样本的输入,而不支持单个样本。例如,nn.Conv2d将接受一个4维的张量nSamples x nChannels x Height x Width。如果你只有单个样本,那么只需要使用input.unsqueeze(0)来添加一个假的批量维度。

在继续之前,让我们回顾一下你目前所见到的所有类。

简要回顾:

  • torch.Tensor:一个多维数组。
  • autograd.Variable:封装了一个张量和对该张量操作的记录历史。除了与张量具有相同的API外,还拥有一些如backward()等的操作。此外,还持有对张量的梯度w.r.t.。
  • nn.Module:神经网络模块。一种封装参数的便捷方式,并含有将参数移到GPU、导出、加载等的辅助功能。
  • nn.Parameter:一种变量,当作为一个属性分配给一个模块时,它会自动注册为一个参数。
  • autograd.Function:实现autograd操作的前向和后向定义。每个变量操作,至少创建一个单独的函数节点,连接到创建了一个变量的函数,并对其历史进行编码。

在本节,我们学习了以下内容:

  • 定义一个神经网络
  • 处理输入及后向调用

剩余技能:

  • 计算损失
  • 更新网络权重

0x02 损失函数(Loss Function)

损失函数接受(输出,目标)输入对,并计算一个值,该值能够评估输出与目标的偏差大小。

nn包中有几个不同的损失函数。一个简单的损失函数是nn.MSELoss,它会计算输入和目标之间的均方误差。

例如:

output = net(input)
target = Variable(torch.arange(1, 11))  # a dummy target, for example
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

输出结果:

Variable containing:
 38.8243
[torch.FloatTensor of size 1]

现在,如果你沿着后向跟踪损失,那么可以使用它的``.grad_fn`属性,你将会看到一个这样的计算图:

input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
      -> view -> linear -> relu -> linear -> relu -> linear
      -> MSELoss
      -> loss

所以,当我们调用loss.backward()时,整个计算图是对损失函数求微分后的,并且图中所有的变量将使自己的.grad变量与梯度进行累积。

为了便于展示,我们反向跟随几步:

print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU

输出结果:

<MseLossBackward object at 0x7fe4c18539e8>
<AddmmBackward object at 0x7fe3f5498550>
<ExpandBackward object at 0x7fe4c18539e8>

0x03 反向传播(Backprop)

为了反向传播误差,我们所要做的就是调用loss.backward()。不过,你需要清除现有的梯度,否则梯度就会累积到已有的梯度上。

现在我们应该调用loss.backward(),并在反向之前和之后查看conv1的偏差梯度。

net.zero_grad()     # zeroes the gradient buffers of all parameters

print(\'conv1.bias.grad before backward\')
print(net.conv1.bias.grad)

loss.backward()

print(\'conv1.bias.grad after backward\')
print(net.conv1.bias.grad)

输出结果:

conv1.bias.grad before backward
Variable containing:
 0
 0
 0
 0
 0
 0
[torch.FloatTensor of size 6]

conv1.bias.grad after backward
Variable containing:
1.00000e-02 *
  7.4571
 -0.4714
 -5.5774
 -6.2058
  6.6810
  3.1632
[torch.FloatTensor of size 6]

现在,我们已经看到了如何使用损失函数。

扩展阅读:
神经网络包中包含各种各样的模块和损失函数,它们构成了深度神经网络的构造块。完整的文档列表可以在这里查看。

唯一剩下的待学习的知识点:

  • 更新网络的权重

0x04 更新权重

在实践中使用的最简单的更新规则是随机梯度下降法(Stochastic Gradient Descent,SGD):
weight = weight - learning_rate * gradient
我们可以使用简单的python代码实现这一点:

learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)

然而,当你使用神经网络时,你可能想使用各种不同的更新规则,如SGD、Nesterov-SGD、Adam、RMSProp等等。为了实现这一点,我们构建了一个小的工具包torch.optim,它实现了所有这些方法。使用它非常简单:

import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update

注意:
要观察梯度是如何缓存的,需要手动调用optimizer.zero_grad()将缓冲器设置为0。这是因为梯度会累积,正如在“反向传播”一节中解释的那样。

脚本总运行时间:0分0.129秒。

以上是关于PyTorch 101 Part 1: Understanding Graphs, Automatic Differentiation and Autograd的主要内容,如果未能解决你的问题,请参考以下文章

PyTorch深度学习60分钟快速入门 Part1:PyTorch是什么?

基于PyTorch的深度学习入门教程——构建神经网络

基于PyTorch的深度学习入门教程——构建神经网络

PyTorch深度学习60分钟快速入门 Part0:系列介绍

基于pytorch+Resnet101加GPT搭建AI玩王者荣耀

PyTorch深度学习60分钟快速入门 Part2:Autograd自动化微分