2.MLP构建、前向、反向
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了2.MLP构建、前向、反向相关的知识,希望对你有一定的参考价值。
参考技术A%matplotlib inline
Lab1 - Multilayer Perceptrons
In this lab, we are going through 3 examples of MLP, which covers the implementation from scratch and the standard library.
Before you get started, please install numpy , torch and torchvision in advance.
We suggest you run the following cells and study the internal mechanism of the neural networks. Moreover, it is also highly recommended that you should tune the hyper-parameters to gain better results.
Some insights of dropout and xavier initialization has been adapted from Mu Li \'s course Dive into Deep Learning .
First of all, we utilize the MNIST dataset for example.
For simplicity, we use the premade dataset powered by torchvision , therefore we don\'t have to worry about data preprocessing : )
Before moving on, please check the basic concepts of Dataset and DataLoader of PyTorch .
Warm-up: numpy
A fully-connected ReLU network with one hidden layer and no biases, trained to
predict y from x using cross-entropy loss .
This implementation uses numpy to manually compute the forward pass, loss, and
backward pass.
A numpy array is a generic n-dimensional array; it does not know anything about
deep learning or gradients or computational graphs, and is just a way to perform
generic numeric computations.
Calculating gradients manually is prone to error; be careful when doing it yourself.
If you found it difficult, please refer to these sites( link1 , link2 ).
PyTorch: Tensors and autograd
A fully-connected ReLU network with one hidden layer and no biases, trained to
predict y from x by minimizing cross-entropy loss.
This implementation computes the forward pass using operations on PyTorch
Tensors, and uses PyTorch autograd to compute gradients.
A PyTorch Tensor represents a node in a computational graph. If x is a
Tensor that has x.requires_grad=True then x.grad is another Tensor
holding the gradient of x with respect to some scalar value.
Let\'s think briefly about what we expect from a good statistical model. Obviously we want it to do well on unseen test data. One way we can accomplish this is by asking for what amounts to a \'simple\' model. Simplicity can come in the form of a small number of dimensions, which is what we did when discussing fitting a function with monomial basis functions. Simplicity can also come in the form of a small norm for the basis funtions. This is what led to weight decay and regularization. Yet a third way to impose some notion of simplicity is that the function should be robust under modest changes in the input. For instance, when we classify images, we would expect that alterations of a few pixels are mostly harmless.
In fact, this notion was formalized by Bishop in 1995, when he proved that Training with Input Noise is Equivalent to Tikhonov Regularization . That is, he connected the notion of having a smooth (and thus simple) function with one that is resilient to perturbations in the input. Fast forward to 2014. Given the complexity of deep networks with many layers, enforcing smoothness just on the input misses out on what is happening in subsequent layers. The ingenious idea of Srivastava et al., 2014 was to apply Bishop\'s idea to the internal layers of the network, too, namely to inject noise into the computational path of the network while it\'s training.
A key challenge in this context is how to add noise without introducing undue bias. In terms of inputs , this is relatively easy to accomplish: simply add some noise to it and use this data during training via . A key property is that in expectation . For intermediate layers, though, this might not be quite so desirable since the scale of the noise might not be appropriate. The alternative is to perturb coordinates as follows:
$$
\\beginaligned
h\' =
\\begincases
0 & \\text with probability p \\
\\frach1-p & \\text otherwise
\\endcases
\\endaligned
$$
By design, the expectation remains unchanged, i.e. . This idea is at the heart of dropout where intermediate activations are replaced by a random variable with matching expectation. The name \'dropout\' arises from the notion that some neurons \'drop out\' of the computation for the purpose of computing the final result. During training we replace intermediate activations with random variables
PyTorch: Standard APIs
A fully-connected ReLU network with one hidden layer, trained to predict y from x
by minimizing cross-entropy loss.
This implementation uses the nn package from PyTorch to build the network.
PyTorch autograd makes it easy to define computational graphs and take gradients,
but raw autograd can be a bit too low-level for defining complex neural networks;
this is where the nn package can help. The nn package defines a set of Modules,
which you can think of as a neural network layer that has produces output from
input and may have some trainable weights.
NOTICE :
In this section, we use built-in optimizer SGD with another hyper-parameter, i.e. momentum.
In the previous sections, e.g. in “Concise Implementation of Linear Regression” , we used net.initialize(init.Normal(sigma=0.01)) as a way to pick normally distributed random numbers as initial values for the weights. If the initialization method is not specified, such as net.initialize() , MXNet will use the default random initialization method: each element of the weight parameter is randomly sampled with an uniform distribution and the bias parameters are all set to . Both choices tend to work quite well in practice for moderate problem sizes.
Let\'s look at the scale distribution of the activations of the hidden units for some layer. They are given by
The weights are all drawn independently from the same distribution. Let\'s furthermore assume that this distribution has zero mean and variance (this doesn\'t mean that the distribution has to be Gaussian, just that mean and variance need to exist). We don\'t really have much control over the inputs into the layer but let\'s proceed with the somewhat unrealistic assumption that they also have zero mean and variance and that they\'re independent of . In this case we can compute mean and variance of as follows:
$$
\\beginaligned
\\mathbfE[h_i] & = \\sum_j=1^n_\\mathrmin \\mathbfE[W_ij x_j] = 0 \\
\\mathbfE[h_i^2] & = \\sum_j=1^n_\\mathrmin \\mathbfE[W^2_ij x^2_j] \\
& = \\sum_j=1^n_\\mathrmin \\mathbfE[W^2_ij] \\mathbfE[x^2_j] \\
& = n_\\mathrmin \\sigma^2 \\gamma^2
\\endaligned
$$
One way to keep the variance fixed is to set . Now consider backpropagation. There we face a similar problem, albeit with gradients being propagated from the top layers. That is, instead of we need to deal with , where is the incoming gradient from the layer above. Using the same reasoning as for forward propagation we see that the gradients\' variance can blow up unless . This leaves us in a dilemma: we cannot possibly satisfy both conditions simultaneously. Instead, we simply try to satisfy
$$
\\beginaligned
\\frac12 (n_\\mathrmin + n_\\mathrmout) \\sigma^2 = 1 \\text or equivalently
\\sigma = \\sqrt\\frac2n_\\mathrmin + n_\\mathrmout
\\endaligned
$$
This is the reasoning underlying the eponymous Xavier initialization, proposed by Xavier Glorot and Yoshua Bengio in 2010. It works well enough in practice. For Gaussian random variables the Xavier initialization picks a normal distribution with zero mean and variance .
For uniformly distributed random variables note that their variance is given by . Plugging into the condition on yields that we should initialize uniformly with
.
以上是关于2.MLP构建、前向、反向的主要内容,如果未能解决你的问题,请参考以下文章