A review of gradient descent optimization methods

Posted 2020-11-13 gaoqichao

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了A review of gradient descent optimization methods相关的知识，希望对你有一定的参考价值。

Suppose we are going to optimize a parameterized function (J( heta)), where ( heta in mathbb{R}^d), for example, ( heta) could be a neural net.

More specifically, we want to (mbox{ minimize } J( heta; mathcal{D})) on dataset (mathcal{D}), where each point in (mathcal{D}) is a pair ((x_i, y_i)).

There are different ways to apply gradient descent.

Let (eta) be the learning rate.

Vanilla batch update
( heta gets heta - eta abla J( heta; mathcal{D}))
Note that ( abla J( heta; mathcal{D})) computes the gradient on of the whole dataset (mathcal{D}).
```python
for i in range(n_epochs):
gradient = compute_gradient(J, theta, D)
theta = theta - eta * gradient
eta = eta * 0.95

```
It is obvious that when (mathcal{D}) is too large, this approach is unfeasible.

Stochastic Gradient Descent
Stochastic Gradient, on the other hand, update the parameters example by example.
( heta gets heta - eta *J( heta, x_i, y_i)), where ((x_i, y_i) in mathcal{D}).
```
for n in range(n_epochs):
for x_i, y_i in D: 
    gradient=compute_gradient(J, theta, x_i, y_i)
    theta = theta - eta * gradient 
eta = eta * 0.95 
```
Mini-batch Stochastic Gradient Descent
Update ( heta) example by example could lead to high variance, the alternative approach is to update ( heta) by mini-batches (M) where (|M| ll |mathcal{D}|).
```
for n in range(n_epochs):
for M in D: 
    gradient = compute_gradient(J, M)
    theta = theta - eta * gradient 
eta = eta * 0.95
```

Question? Why decaying the learning rate leads to convergence?

以上是关于A review of gradient descent optimization methods的主要内容，如果未能解决你的问题，请参考以下文章