05 Neural Networks
Posted qq-1615160629
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了05 Neural Networks相关的知识,希望对你有一定的参考价值。
Neural Networks
The ‘one learning algorithm’ hypothesis
- Neuron-rewiring experiments
Model Representation
Define
- Sigmoid(logistic) activation function
- bias unit
- input layer
- output layer
- hidden layer
- \(a_i^(j)\) : ‘activation’ of unit \(i\) in layer \(j\)
- \(\theta^(j)\): matrix of weights controlling function mapping from layer \(j\) to layer \(j + 1\).
Calculate
\[a^(j) = g(z^(j))\]
\[g(x) = \frac11 + e^-x\]
\[z^(j + 1) = \Theta^(j)a^(j)\]
\[h_\theta(x) = a^(j + 1) = g(z^(j + 1))\]
Cost Function
\[
J(\Theta) = - \frac1m \sum_i=1^m \sum_k=1^K \left[y^(i)_k \log ((h_\Theta (x^(i)))_k) + (1 - y^(i)_k)\log (1 - (h_\Theta(x^(i)))_k)\right] + \]
\[\frac\lambda2m\sum_l=1^L-1 \sum_i=1^s_l \sum_j=1^s_l+1 ( \Theta_j,i^(l))^2
\]
Back-propagation Algorithm
Algorithm
- Hypothesis we have calculated all the \(a^(l)\) and \(z^(l)\)
- set \(\Delta^(l)_i, j := 0\) for all (l, i, j)
- using \(y^(t)\), compute \(\delta^L = a^(L) - y^(t)\), where \(y^(t)_k(i) \in 0, 1\) indicates whether the current training example belongs to class k\(y^(t)_k(k) = 1\), or if it belongs to a different class = 0;
- For the hidden layer \(l = L - 1\) down to 2, set
\[
\delta^(l) = (\Theta^(l))^T\delta^(l + 1) .* g’(z^(l))
\] - remember remove \(\delta_0^(l)\) by.
delta(2:end)
\[
\Delta^(l) = \Delta^(l) + \delta^(l + 1)(a^(l))^T
\] - gradient
\[
\frac\partial\partial\Theta^(l)_i,jJ(\Theta) = D^(l)_i,j = \frac1m\Delta^(l)_i,j +
\begincases \frac\lambdam\Theta^(l)_i, j, & \text if j $\geq$ 1 \\ 0, & \textif j = 0 \endcases
\]
Gradient Checking
- \[
\fracdd\ThetaJ(\Theta) \approx \fracJ(\Theta + \epsilon) - J(\Theta - \epsilon)2\epsilon
\] - A small value for \(\epsilon\) such as \(\epsilon = 10^-4\)
- check that gradApprox \(\approx\) deltalVector
4.
epsilon = 1e-4;
for i = 1 : n
thetaPlus = theta;
thetaPlus(i) += epsilon;
thetaMinus = theta;
thetaMinus(i) -= epsilon;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2 * epsilon);
end;
Rolling and Unrolling
Random Initialization
Theta = rand(n, m)) * (2 * INIT_EPSILON) - INIT_EPSILON;
- initialize \( \Theta^(l)_ij \in [-\epsilon, \epsilon] \)
- else if we initializing all theta weights to zero, all nodes will update to the same value repeatedly when we back_propagate.
- One effective strategy for choosing \(\epsilon_init\) is to base the number of units in the network. A good choice of \(\epsilon_init\) is \(\epsilon_init = \frac\sqrt6\sqrtL_in + L_out \)
Training a Neural Network
- Randomly initialize weights
Theta = rand(n, m) * (2 * epsilon) - epsilon;
- Implement forward propagation to get \(h_\Theta(x^(i))\) for any \(x^(i)\)
- Implement code to compute cost function \(J(\Theta)\)
Implement back-prop to compute partial derivatives \( \fracd(J\Theta)d\Theta_jk^(l) \)
- \( g’(z) = \fracddzg(z) = g(z)(1 - g(z))\)
- \( sigmoid(z) = g(z) = \frac11 + e^-z\)
Use gradient checking to compare \( \fracd(J\Theta)d\Theta_jk^(l) \) computed using back-propagation vs. using numerical estimate of gradient of \(J(\Theta)\)
Then disable gradient checking codeUse gradient descent or advanced optimization method with back-propagation to try to minimize \(J(\Theta)\) as a function of parameters \(\Theta\)
以上是关于05 Neural Networks的主要内容,如果未能解决你的问题,请参考以下文章