使用PyTorch中的KFAC优化更深层的网络

Posted fastai中文社区

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了使用PyTorch中的KFAC优化更深层的网络相关的知识,希望对你有一定的参考价值。

Optimizing deeper networks 

with KFAC in PyTorch.


使用PyTorch中的KFAC优化更深层的网络


Optimization becomes less effective in first order methods like Adam as batch-size and depth increases. Second order methods like KFAC (an approximate natural gradient method) are a bit more expensive, but are much less affected by depth. For a difficult problem this translates to savings in wall-clock time.

 

像Adam这样的一阶方法,随着batch的大小和深度的增加,变得不那么有效了。像KFAC(一种近似自然梯度的方法)这样的二阶方法稍微贵一点,但受深度的影响要小得多。对于一个困难的问题,这意味着时钟时间的节约。


I’ve recently experimented with KFAC in PyTorch. Its imperative style of programming made it easier to prototype optimization algorithms than graph-based approach of TensorFlow. For a fully connected network, an existing optimizer can be augmented with KFAC preconditioning in just a few lines of PyTorch, see “Implementation” below.

Consider the following autoencoder on MNIST

 

我最近在PyTorch上用KFAC做实验。它的命令式编程风格使得原型优化算法比用基于图形的TensorFlow更容易实现。对于一个全连接网络,现有的优化器可以通过几行PyTorch用KFAC实现预处理,参见下面的“实现”。

考虑一下MNIST上的自编码器。


使用PyTorch中的KFAC优化更深层的网络

Table 1: Made up data we are using for this example


Optimizing this architecture using batch of size 10k, the advantage of KFAC is stark, 100x less iterations and 25x less wall-clock time than Adam to reach the test loss minimum.

 

使用10k大小的批次优化此架构,KFAC的优点是明显的,比Adam100倍更少的迭代次数和25倍更少的时钟时间,达到了测试损失的最小值。


使用PyTorch中的KFAC优化更深层的网络

Graph 1: Test loss of  three different methods


Derivation

Traditional derivation of KFAC (Martens, Grosse, Ba) is motivated by information geometry. Below I give an alternative derivation — KFAC-style update is simply the Newton-step for a deep linear neural network.

推导

传统的KFAC推导(Martens,Grosse,Ba)是出于信息几何学。下面我给的另一种推导 —KFAC风格更新紧紧是深度线性神经网络的牛顿迭代步。


To more derivation concrete, consider optimizing a deep fully-connected linear autoencoder. Without loss of generality, we can write our predictions Y as a function of parameter matrix W as follows:

 

对于更具体的推导,考虑优化一个深度全连接线性自编码器。 不失一般性,我们可以将我们的预测结果Y写为参数矩阵W的函数,如下所示:

使用PyTorch中的KFAC优化更深层的网络

Given labels \hat{Y} we can write our prediction error e and loss J:

 

鉴于标签\{Y的预测值}我们可以得出预测误差e和损耗J:


使用PyTorch中的KFAC优化更深层的网络

To minimize J, we differentiate with respect to W and get the following for our gradient and SGD update rule:

 

为了最小化J,我们对W求导,并得到下式推导出梯度和随机梯度下降更新规则:


使用PyTorch中的KFAC优化更深层的网络


Note that quantity “Be” in equation above is equal to the backprop matrix you get in a reverse-mode AD algorithm. It is the “grad_output” quantity passed into PyTorch backward() method.

To get the Hessian, we differentiate our gradient G again to get the result in terms of Kronecker product:

 

注意,上面公式中的数量“Be”等于你在逆向模式AD算法中得到的反向传播矩阵。这是传递给PyTorch backward()方法的“grad_output”数量。

要获得Hessian,我们再次对梯度G求导,以获得Kronecker乘积的结果:


使用PyTorch中的KFAC优化更深层的网络


Dividing by the Hessian and rearranging, our Newton-update step becomes this

 

通过除以Hessian并重排,我们的牛顿更新迭代步变成了如下所示:


使用PyTorch中的KFAC优化更深层的网络


Matrices on each side of G are known as whitening matrices. The first matrix is the backprop whitening matrix, while the second matrix is the activation whitening matrix.

 

G每一边上的矩阵称为白化矩阵。 第一个矩阵是反向传播白化矩阵,而第二个矩阵是激活白化矩阵。


Note that matrix B is not directly available during backprop, and using “grad_output” in its place will get use Bee’B’ instead of BB’. That’s not a problem since we can generate any “e” by selecting target labels accordingly. Some choices for e:

Padded identity matrix so that ee’ is identity. Then Bee’B’=BB’ exactly

 IID gaussian values. Then, Bee’B’=BB’ in expectation

Detailed derivation is here

 

注意,矩阵B不能在反向传播期直接使用,在其位置上使用grad_output 将会用Bee'B’替代BB。这不是问题,因为我们可以通过选择相应的目标标签来生成任何“e”。e的一些选择:

填充单位矩阵,使ee’是单一的。

然后正好Bee'B' = BB

独立同分布的高斯值。然后,Bee'B ' = BB的期望

详细的推导在这里

Implementation

Basic implementation has three parts:

      capture: compute gradients and save forward/backprop values

      invert: compute whitening matrices

      kfac: apply whitening matrices during gradient computation

Steps “capture” and “kfac” can be accomplished with a version of Addmm that has a custom “backwards” method:

实施

基本实现分为三部分:

     捕获:计算梯度和保存前向/反向传播值

     转置:计算白化矩阵

     kfac:在梯度计算中应用白化矩阵

步骤“捕捉”和“kfac”可以用一个具有自定义“向后”方法的Addmm版本完成:


Table 2: kfac.py hosted with by GitHub


The body of training loop then looks like this


 训练循环的主体,然后是这样的


Table 3: view raw kfac.py hosted with by GitHub


Full implementation can be seen in appendix


完整的实现见附录


Note

The difference between Adam and KFAC shrunk to about 5x improvement in wall-clock time when I tweaked the experiment to make it more amenable to SGD

      Replace sigmoid activations with ReLU

      Add weight normalization

      Use batch size 128 for Adam

注意

我调整实验使kfac更易控制SGD时,Adam和kfac之间的差异缩小到了5倍时钟时间的改进。

    用ReLU代替Sigmoid激活

    添加权重正则化

    为Adam使用128大小的batch

appendix  

KFAC

https://arxiv.org/abs/1503.05671

Description of experiment is here

https://github.com/yaroslavvb/kfac_pytorch/blob/master/deep_autoencoder.ipynb

Table 2

https://gist.github.com/yaroslavvb2/2d92df19af84298c87416fc6b510d88a/raw/a701c6af981deffd5e973241c305efdd0772f0ca/kfac.py

Table 3

https://gist.github.com/yaroslavvb2/383dd9620476733de6eef5762282e4d4/raw/7f76fcff3c17311c89339e7bc435fedd0aed8378/optimizer_step.py

Full implementation is here

 https://github.com/yaroslavvb/kfac_pytorch/blob/master/kfac_pytorch.py

以上是关于使用PyTorch中的KFAC优化更深层的网络的主要内容,如果未能解决你的问题,请参考以下文章

以optim.SGD为例介绍pytorch优化器

超重量级深层神经网络与优化算法

深层神经网络的超参数调试、正则化及优化

深度学习|ResNet

改善深层神经网络:超参数调试正则化及优化

计算机科学2017.05卷积神经网络在深层组织显微镜相位预测中的应用