understanding backpropagation

Posted 2020-08-20 cslxiao

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了understanding backpropagation相关的知识，希望对你有一定的参考价值。

几个有助于加深对反向传播算法直观理解的网页，包括普通前向神经网络，卷积神经网络以及利用BP对一般性函数求导（UFLDL）

A Visual Explanation of the Back Propagation Algorithm for Neural Networks

By Sebastian Raschka, Michigan State University.

Let‘s assume we are really into mountain climbing, and to add a little extra challenge, we cover eyes this time so that we can‘t see where we are and when we accomplished our "objective," that is, reaching the top of the mountain.

Since we can‘t see the path upfront, we let our intuition guide us: assuming that the mountain top is the "highest" point of the mountain, we think that the steepest path leads us to the top most efficiently.

We approach this challenge by iteratively "feeling" around you and taking a step into the direction of the steepest ascent -- let‘s call it "gradient ascent." But what do we do if we reach a point where we can‘t ascent any further? I.e., each direction leads downwards? At this point, we may have already reached the mountain‘s top, but we could just have reached a smaller plateau ... we don‘t know. Essentially, this is just an analogy of gradient ascent optimization (basically the counterpart of minimizing a cost function via gradient descent). However, this is not specific to backpropagation but just one way to minimize a convex cost function (if there is only a global minima) or non-convex cost function (which has local minima like the "plateaus" that let us think we reached the mountain‘s top). Using a little visual aid, we could picture a non-convex cost function with only one parameter (where the blue ball is our current location) as follows:

Now, backpropagation is just back-propagating the cost over multiple "levels" (or layers). E.g., if we have a multi-layer perceptron, we can picture forward propagation (passing the input signal through a network while multiplying it by the respective weights to compute an output) as follows:

And in backpropagation, we "simply" backpropagate the error (the "cost" that we compute by comparing the calculated output and the known, correct target output, which we then use to update the model parameters):

It may be some time ago since pre-calc, but it‘s essentially all based on the simple chain-rule that we use for nested functions

Instead of doing this "manually" we can use computational tools (called "automatic differentiation"), and backpropagation is basically the "reverse" mode of this auto-differentiation. Why reverse and not forward? Because it is computationally cheaper! If we‘d do it forward-wise, we‘d successively multiply large matrices for each layer until we multiply a large matrix by a vector in the output layer. However, if we start backwards, that is, we start by multiplying a matrix by a vector, we get another vector, and so forth. So, I‘d say the beauty in backpropagation is that we are doing more efficient matrix-vector multiplications instead of matrix-matrix multiplications.

Deriving gradients using the backpropagation idea

From UFLDL tutorial

Introduction

In the section on the backpropagation algorithm, you were briefly introduced to backpropagation as a means of deriving gradients for learning in the sparse autoencoder. It turns out that together with matrix calculus, this provides a powerful method and intuition for deriving gradients for more complex matrix functions (functions from matrices to the reals, or symbolically, from $技术分享$ ).

First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below:

For each output unit $i in layer n l (the final layer), set$
$技术分享$
$where J (z) is our "objective function" (explained below).$
For $技术分享$
For each node $i in layer l, set$
$技术分享$
Compute the desired partial derivatives,
$技术分享$

Quick notation recap:

$l is the number of layers in the neural network$
$n l is the number of neurons in the l th layer$
$技术分享$ is the weight from the $i th unit in the l th layer to the j th unit in the (l + 1)th layer$
$技术分享$ is the input to the $i th unit in the l th layer$
$技术分享$ is the activation of the $i th unit in the l th layer$
$技术分享$ is the Hadamard or element-wise product, which for $技术分享$ matrices $技术分享$
$f (l) is the activation function for units in the l th layer$

Let‘s say we have a function $技术分享$

To do this, we will set our "objective function" to be the function $J (z) that when applied to the outputs of the neurons in the last layer yields the value F (X). For the intermediate layers, we will also choose our activation functions f (l) to this end.$

Using this method, we can easily compute derivatives with respect to the inputs $X, as well as derivatives with respect to any of the weights in the network, as we shall see later.$

Examples

To illustrate the use of the backpropagation idea to compute derivatives with respect to the inputs, we will use two functions from the section onsparse coding, in examples 1 and 2. In example 3, we use a function from independent component analysis to illustrate the use of this idea to compute derivates with respect to weights, and in this specific case, what to do in the case of tied or repeated weights.

Example 1: Objective for weight matrix in sparse coding

Recall for sparse coding, the objective function for the weight matrix $A, given the feature matrix s :$

$技术分享$

We would like to find the gradient of $技术分享$

The first term, $技术分享$ , can be seen as an instantiation of neural network taking $s as an input, and proceeding in four steps, as described and illustrated in the paragraph and diagram below:$

Apply $A as the weights from the first layer to the second layer.$
Subtract $x from the activation of the second layer, which uses the identity activation function.$
Pass this unchanged to the third layer, via identity weights. Use the square function as the activation function for the third layer.
Sum all the activations of the third layer.

The weights and activation functions of this network are as follows:

Layer	Weight	Activation function $f$
1	$A$	$f (z i) = z i (identity)$
2	$I (identity)$	$f (z i) = z i ? x i$
3	N/A	$技术分享$

To have $技术分享$

Once we see $技术分享$

Layer	Derivative of activation function $f ‘$	Delta	Input $z to this layer$
3	$f ‘(z i) = 2 z i$	$f ‘(z i) = 2 z i$	$A s ? x$
2	$f ‘(z i) = 1$	$技术分享$	$A s$
1	$f ‘(z i) = 1$	$技术分享$	$s$

Hence,

$技术分享$

Example 2: Smoothed topographic L1 sparsity penalty in sparse coding

Recall the smoothed topographic L1 sparsity penalty on $s in sparse coding :$

$技术分享$

where $V is the grouping matrix, s is the feature matrix and ε is a constant.$

We would like to find $技术分享$ . As above, let‘s see this term as an instantiation of a neural network:

The weights and activation functions of this network are as follows:

Layer	Weight	Activation function $f$
1	$I$	$技术分享$
2	$V$	$f (z i) = z i$
3	$I$	$f (z i) = z i + ε$
4	N/A	$技术分享$

To have $技术分享$

Once we see $技术分享$

Layer	Derivative of activation function $f ‘$	Delta	Input $z to this layer$
4	$技术分享$	$技术分享$	$(V s s T + ε)$
3	$f ‘(z i) = 1$	$技术分享$	$V s s T$
2	$f ‘(z i) = 1$	$技术分享$	$s s T$
1	$f ‘(z i) = 2 z i$	$技术分享$	$s$

Hence,

$技术分享$

Example 3: ICA reconstruction cost

Recall the independent component analysis (ICA) reconstruction cost term: $技术分享$ where $W is the weight matrix and x is the input.$

We would like to find $技术分享$ - the derivative of the term with respect to the weight matrix, rather than the input as in the earlier two examples. We will still proceed similarly though, seeing this term as an instantiation of a neural network:

The weights and activation functions of this network are as follows:

Layer	Weight	Activation function $f$
1	$W$	$f (z i) = z i$
2	$W T$	$f (z i) = z i$
3	$I$	$f (z i) = z i ? x i$
4	N/A	$技术分享$

To have $技术分享$

Now that we can see $技术分享$

Layer	Derivative of activation function $f ‘$	Delta	Input $z to this layer$
4	$f ‘(z i) = 2 z i$	$f ‘(z i) = 2 z i$	$(W T W x ? x)$
3	$f ‘(z i) = 1$	$技术分享$	$W T W x$
2	$f ‘(z i) = 1$	$技术分享$	$W x$
1	$f ‘(z i) = 1$	$技术分享$	$x$

To find the gradients with respect to $W, first we find the gradients with respect to each instance of W in the network.$

With respect to $W T :$

$技术分享$

With respect to $W :$

$技术分享$

Taking sums, noting that we need to transpose the gradient with respect to $W T to get the gradient with respect to W, yields the final gradient with respect to W (pardon the slight abuse of notation here):$

$技术分享$

Convolutional Neural Networks backpropagation: from intuition to derivation

Disclaimer: It is assumed that the reader is familiar with terms such as Multilayer Perceptron, delta errors or backpropagation. If not, it is recommended to read for example a chapter 2 of free online book ‘Neural Networks and Deep Learning’ byMichael Nielsen.

Convolutional Neural Networks (CNN) are now a standard way of image classification – there are publicly accessible deep learning frameworks, trained models and services. It’s more time consuming to install stuff like caffe than to perform state-of-the-art object classification or detection. We also have many methods of getting knowledge -there is a large number of deep learning courses/MOOCs, free e-books or even direct ways of accessing to the strongest Deep/Machine Learning minds such as Yoshua Bengio, Andrew NG or Yann Lecun by Quora, Facebook or G+.

Nevertheless, when I wanted to get deeper insight in CNN, I could not find a “CNN backpropagation for dummies”. Notoriously I met with statements like: “If you understand backpropagation in standard neural networks, there should not be a problem with understanding it in CNN” or “All things are nearly the same, except matrix multiplications are replaced by convolutions”. And of course I saw tons of ready equations.

It was a little consoling, when I found out that I am not alone, for example: Hello, when computing the gradients CNN, the weights need to be rotated, Why ?

技术分享

The answer on above question, that concerns the need of rotation on weights in gradient computing, will be a result of this long post.

We start from multilayer perceptron and counting delta errors on fingers:

技术分享

We see on above picture that $技术分享$ is proportional to deltas from next layer that are scaled by weights.

But how do we connect concept of MLP with Convolutional Neural Network? Let’s play with MLP:

技术分享 Transforming Multilayer Perceptron to Convolutional Neural Network

If you are not sure that after connections cutting and weights sharing we get one layer Convolutional Neural Network, I hope that below picture will convince you:

技术分享

Feedforward in CNN is identical with convolution operation

The idea behind this figure is to show, that such neural network configuration is identical with a 2D convolution operation and weights are just filters (also called kernels, convolution matrices, or masks).

Now we can come back to gradient computing by counting on fingers, but from now we will be only focused on CNN. Let’s begin:

技术分享

Backpropagation also results with convolution

No magic here, we have just summed in “blue layer” scaled by weights gradients from “orange” layer. Same process as in MLP’s backpropagation. However, in the standard approach we talk about dot products and here we have … yup, again convolution:

技术分享

Yeah, it is a bit different convolution than in previous (forward) case. There we did so called valid convolution, while here we do a full convolution (more about nomenclaturehere). What is more, we rotate our kernel by 180 degrees. But still, we are talking about convolution!

Now, I have some good news and some bad news:

you see (BTW, sorry for pictures aesthetics :) ), that matrix dot products are replaced by convolution operations both in feed forward and backpropagation.
you know that seeing something and understanding something … yup, we are going now to get our hands dirty and prove above statement <fn> before getting next, I recommend to read, mentioned already in the disclaimer, chapter 2 of M. Nielsen book. I tried to make all quantities to be consistent with work of Michael.

In the standard MLP, we can define an error of neuron j as:

$技术分享$

where $技术分享$ is just:

$技术分享$

and for clarity, $技术分享$ , where $技术分享$ is an activation function such as sigmoid, hyperbolic tangent or relu.

But here, we do not have MLP but CNN and matrix multiplications are replaced by convolutions as we discussed before. So instead of $技术分享$ we do have a $技术分享$ :

$技术分享$

Above equation is just a convolution operation during feedforward phase illustrated in the above picture titled ‘Feedforward in CNN is identical with convolution operation’

Now we can get to the point and answer the question Hello, when computing the gradients CNN, the weights need to be rotated, Why ?

We start from statement:

$技术分享$

We know that $技术分享$ is in relation to $技术分享$ which is indirectly showed in the above picture titled ‘Backpropagation also results with convolution’. So sums are the result of chain rule. Let’s move on:

$技术分享$

First term is replaced by definition of error, while second has become large because we put it here expression on $技术分享$ . However, we do not have to fear of this big monster – all components of sums equal 0, except these ones that are indexed: $技术分享$ and $技术分享$ . So:

$技术分享$

If $技术分享$ and $技术分享$ then it is obvious that $技术分享$ and $技术分享$ so we can reformulate above equation to:

$技术分享$

OK, our last equation is just …

$技术分享$

Where is the rotation of weights? Actually $技术分享$ .

So the answer on question Hello, when computing the gradients CNN, the weights need to be rotated, Why ? is simple: the rotation of the weights just results from derivation of delta error in Convolution Neural Network.

OK, we are really close to the end. One more ingredient of backpropagation algorithm is update of weights $技术分享$ :