05 Neural Networks

Posted qq-1615160629

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了05 Neural Networks相关的知识,希望对你有一定的参考价值。

Neural Networks

The ‘one learning algorithm’ hypothesis

  1. Neuron-rewiring experiments

Model Representation

Define
  1. Sigmoid(logistic) activation function
  2. bias unit
  3. input layer
  4. output layer
  5. hidden layer
  6. \(a_i^(j)\) : ‘activation’ of unit \(i\) in layer \(j\)
  7. \(\theta^(j)\): matrix of weights controlling function mapping from layer \(j\) to layer \(j + 1\).
Calculate

\[a^(j) = g(z^(j))\]
\[g(x) = \frac11 + e^-x\]
\[z^(j + 1) = \Theta^(j)a^(j)\]
\[h_\theta(x) = a^(j + 1) = g(z^(j + 1))\]

Cost Function

\[
J(\Theta) = - \frac1m \sum_i=1^m \sum_k=1^K \left[y^(i)_k \log ((h_\Theta (x^(i)))_k) + (1 - y^(i)_k)\log (1 - (h_\Theta(x^(i)))_k)\right] + \]

\[\frac\lambda2m\sum_l=1^L-1 \sum_i=1^s_l \sum_j=1^s_l+1 ( \Theta_j,i^(l))^2
\]

Back-propagation Algorithm
Algorithm
  1. Hypothesis we have calculated all the \(a^(l)\) and \(z^(l)\)
  2. set \(\Delta^(l)_i, j := 0\) for all (l, i, j)
  3. using \(y^(t)\), compute \(\delta^L = a^(L) - y^(t)\), where \(y^(t)_k(i) \in 0, 1\) indicates whether the current training example belongs to class k\(y^(t)_k(k) = 1\), or if it belongs to a different class = 0;
  4. For the hidden layer \(l = L - 1\) down to 2, set
    \[
    \delta^(l) = (\Theta^(l))^T\delta^(l + 1) .* g’(z^(l))
    \]
  5. remember remove \(\delta_0^(l)\) by. delta(2:end)
    \[
    \Delta^(l) = \Delta^(l) + \delta^(l + 1)(a^(l))^T
    \]
  6. gradient
    \[
    \frac\partial\partial\Theta^(l)_i,jJ(\Theta) = D^(l)_i,j = \frac1m\Delta^(l)_i,j +
    \begincases \frac\lambdam\Theta^(l)_i, j, & \text if j $\geq$ 1 \\ 0, & \textif j = 0 \endcases
    \]
Gradient Checking
  1. \[
    \fracdd\ThetaJ(\Theta) \approx \fracJ(\Theta + \epsilon) - J(\Theta - \epsilon)2\epsilon
    \]
  2. A small value for \(\epsilon\) such as \(\epsilon = 10^-4\)
  3. check that gradApprox \(\approx\) deltalVector

4.

epsilon = 1e-4;
for i = 1 : n
    thetaPlus = theta;
    thetaPlus(i) += epsilon;
    thetaMinus = theta;
    thetaMinus(i) -= epsilon;
    gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2 * epsilon);
end;
Rolling and Unrolling
Random Initialization
Theta = rand(n, m)) * (2 * INIT_EPSILON) - INIT_EPSILON;
  1. initialize \( \Theta^(l)_ij \in [-\epsilon, \epsilon] \)
  2. else if we initializing all theta weights to zero, all nodes will update to the same value repeatedly when we back_propagate.
  3. One effective strategy for choosing \(\epsilon_init\) is to base the number of units in the network. A good choice of \(\epsilon_init\) is \(\epsilon_init = \frac\sqrt6\sqrtL_in + L_out \)
Training a Neural Network
  1. Randomly initialize weights
    Theta = rand(n, m) * (2 * epsilon) - epsilon;
  1. Implement forward propagation to get \(h_\Theta(x^(i))\) for any \(x^(i)\)
  2. Implement code to compute cost function \(J(\Theta)\)
  3. Implement back-prop to compute partial derivatives \( \fracd(J\Theta)d\Theta_jk^(l) \)

    • \( g’(z) = \fracddzg(z) = g(z)(1 - g(z))\)
    • \( sigmoid(z) = g(z) = \frac11 + e^-z\)
  4. Use gradient checking to compare \( \fracd(J\Theta)d\Theta_jk^(l) \) computed using back-propagation vs. using numerical estimate of gradient of \(J(\Theta)\)
    Then disable gradient checking code

  5. Use gradient descent or advanced optimization method with back-propagation to try to minimize \(J(\Theta)\) as a function of parameters \(\Theta\)

以上是关于05 Neural Networks的主要内容,如果未能解决你的问题,请参考以下文章