A review of gradient descent optimization methods

Posted gaoqichao

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了A review of gradient descent optimization methods相关的知识,希望对你有一定的参考价值。

Suppose we are going to optimize a parameterized function (J( heta)), where ( heta in mathbb{R}^d), for example, ( heta) could be a neural net.

More specifically, we want to (mbox{ minimize } J( heta; mathcal{D})) on dataset (mathcal{D}), where each point in (mathcal{D}) is a pair ((x_i, y_i)).

There are different ways to apply gradient descent.

Let (eta) be the learning rate.

  1. Vanilla batch update
    ( heta gets heta - eta abla J( heta; mathcal{D}))
    Note that ( abla J( heta; mathcal{D})) computes the gradient on of the whole dataset (mathcal{D}).
    ```python
    for i in range(n_epochs):
    gradient = compute_gradient(J, theta, D)
    theta = theta - eta * gradient
    eta = eta * 0.95

```
It is obvious that when (mathcal{D}) is too large, this approach is unfeasible.

  1. Stochastic Gradient Descent
    Stochastic Gradient, on the other hand, update the parameters example by example.
    ( heta gets heta - eta *J( heta, x_i, y_i)), where ((x_i, y_i) in mathcal{D}).

    for n in range(n_epochs):
    for x_i, y_i in D: 
        gradient=compute_gradient(J, theta, x_i, y_i)
        theta = theta - eta * gradient 
    eta = eta * 0.95 
  2. Mini-batch Stochastic Gradient Descent
    Update ( heta) example by example could lead to high variance, the alternative approach is to update ( heta) by mini-batches (M) where (|M| ll |mathcal{D}|).

    for n in range(n_epochs):
    for M in D: 
        gradient = compute_gradient(J, M)
        theta = theta - eta * gradient 
    eta = eta * 0.95

Question? Why decaying the learning rate leads to convergence?

以上是关于A review of gradient descent optimization methods的主要内容,如果未能解决你的问题,请参考以下文章

reviews of learn python3 the hard way

Literature Review: ICRA 2020: Beyond Photometric Consistency: Gradient-based Dissimiliarity for Impr

re正则表达式13_review of regex symbols

Numerical Testing Reportes of A New Conjugate Gradient Projection Method for Convex Constrained Nonl

A Review of Visual Tracking with Deep Learning

A Critical Review of Recurrent Neural Networks for Sequence Learning-论文(综述)阅读笔记