Pytorch 的高级库中的 copy_initial_weights 文档是啥意思？

Posted 2023-03-12

技术标签:

【中文标题】Pytorch 的高级库中的 copy_initial_weights 文档是啥意思？【英文标题】：What does the copy_initial_weights documentation mean in the higher library for Pytorch?Pytorch 的高级库中的 copy_initial_weights 文档是什么意思？ 【发布时间】：2020-06-04 06:17:21 【问题描述】：

我试图使用更高级别的库进行元学习，但在理解 copy_initial_weights 的含义时遇到了问题。文档说：

copy_initial_weights – 如果为 true，则复制修补模块的权重以形成修补模块的初始权重，因此在展开修补模块时不属于梯度带的一部分。如果设置为 False，则实际模块权重将是修补模块的初始权重。例如，这在执行 MAML 时很有用。

但这对我来说没有多大意义，因为以下原因：

例如，“修补模块的权重被复制以形成修补模块的初始权重”对我来说没有意义，因为当启动上下文管理器时，修补模块还不存在。所以我们不清楚我们从哪里复制到哪里（以及为什么复制是我们想要做的事情）。

此外，“展开已修补的模块”对我来说没有意义。我们通常展开由 for 循环引起的计算图。一个补丁模块只是一个被这个库修改过的神经网络。展开是不明确的。

此外，“渐变胶带”没有技术定义。

此外，在描述什么是 false 时，说它对 MAML 有用实际上并没有用，因为它甚至没有暗示为什么它对 MAML 有用。

总的来说，使用上下文管理器是不可能的。

任何关于该标志的更精确术语的解释和示例都将非常有价值。

相关：

gitissue:https://github.com/facebookresearch/higher/issues/30 新的 gitissue：https://github.com/facebookresearch/higher/issues/54 pytorch 论坛：https://discuss.pytorch.org/t/why-does-maml-need-copy-initial-weights-false/70387 pytorch 论坛：https://discuss.pytorch.org/t/what-does-copy-initial-weights-do-in-the-higher-library/70384 与此相关的重要问题是关于如何复制 fmodel 参数以便优化器工作（以及使用深拷贝）：Why does higher need to deep copy the parameters of the base model to create a functional model?

【问题讨论】：

是时候回答你自己的问题了github.com/facebookresearch/higher/issues/… @eusoubrasileiro 我正在尝试，但我仍然不清楚。要查看我对文档新措辞的最新讨论，请查看：github.com/facebookresearch/higher/issues/54 @eusoubrasileiro 我当前的答案（一旦我 100% 确定它是正确的，将接受一个答案并发布我自己的答案）如下：1）当 copy_initial_weights 为 True 时，更高的库会生成一个 @987654330 @ 和 detach() 在内部上下文开头给出的模块参数的副本。这意味着不会因为分离而产生梯度流（因此原始的没有经过训练），并且克隆使它们更安全地用于就地操作。 @eusoubrasileiro 当copy_initial_weights=False 它只做.clone()。克隆保留梯度历史，因此原始参数是可训练的。但是克隆不允许直接更改原始参数，使其在进行就地操作时使用起来更安全（例如，您不能覆盖 autograd 将用于计算任何反向传递的张量的内存值）。梯度流向原始参数，因为克隆就是这样工作的，因此您正在训练基础模型的初始化（即基础模型的初始权重）。 @eusoubrasileiro 请注意，您仍然需要设置tracking_higher_grads=True，因为文档说

if True, during unrolled optimization the graph be retained, and the fast weights will bear grad funcs, so as to permit backpropagation through the optimization process

基本上意味着快速权重（即fnet.parameters(T=t)）将具有梯度函数，因此它允许将前向传递连接到能够正确执行反向传播的计算图。 【参考方案1】：

短版

以model 作为参数调用higher.innerloop_ctx，为该模型创建临时修补模型和展开优化器：(fmodel, diffopt)。预计在内循环中 fmodel 将迭代地接收一些输入，计算输出和损失，然后将调用diffopt.step(loss)。每次调用diffopt.step 时fmodel 将创建下一个版本的参数fmodel.parameters(time=T)，这是一个使用以前的张量计算的新张量（完整的图表允许通过该过程计算梯度）。如果在任何时候用户在任何张量上调用backward，常规的pytorch梯度计算/累积将以允许梯度传播到例如的方式开始。优化器的参数（例如 lr、momentum - 如果它们作为需要梯度的张量传递到 higher.innerloop_ctx 使用 override）。

fmodel 的参数fmodel.parameters(time=0) 的创建时版本是原始model 参数的副本。如果提供了copy_initial_weights=True（默认），那么fmodel.parameters(time=0) 将是detach 的model 参数的detach 版本（即，将保留值，但将切断与原始模型的所有连接）。如果提供了copy_initial_weights=False，那么fmodel.parameters(time=0) 将是clone'd 版本的model 的参数，因此将允许渐变传播到原始model 的参数（参见clone 上的pytorch doc） .

术语说明

gradient tape 这里指的是 pytorch 用来通过计算将梯度传播到所有需要梯度的叶张量的图。如果在某些时候你切断了一些需要参数的叶张量的链接（例如，对于 copy_initial_weights=True 情况，fnet.parameters() 是如何完成的），那么对于你的 @987654354，原始的 model.parameters() 将不再“在渐变带上” @计算。

展开修补的模块这里指的是当pytorch从最新开始到最早结束所有fnet.parameters(time=T)时meta_loss.backward()计算的部分（higher不'不控制过程 - 这只是常规的 pytorch 梯度计算，higher 只是负责每次调用diffopt.step 时如何从以前的参数创建这些新的time=T 参数以及如何使用fnet前向计算的最新版本）。

加长版

让我们从头开始。 higher 库的主要功能（实际上只是功能）是以可微分的方式展开模型的参数优化。它可以以直接使用可微优化器的形式出现，例如higher.get_diff_optim 如this example 或higher.innerloop_ctx 的形式如this example。

带有higher.innerloop_ctx 的选项是为您从现有模型中创建“无状态”模型fmodel，并为此fmodel 为您提供“优化器”diffopt。因此，正如在更高版本的 README.md 中总结的那样，它允许您从以下位置切换：

model = MyModel()
opt = torch.optim.Adam(model.parameters())

for xs, ys in data:
    opt.zero_grad()
    logits = model(xs)
    loss = loss_function(logits, ys)
    loss.backward()
    opt.step()

到

model = MyModel()
opt = torch.optim.Adam(model.parameters())

with higher.innerloop_ctx(model, opt) as (fmodel, diffopt):
    for xs, ys in data:
        logits = fmodel(xs)  # modified `params` can also be passed as a kwarg
        loss = loss_function(logits, ys)  # no need to call loss.backwards()
        diffopt.step(loss)  # note that `step` must take `loss` as an argument!

    # At the end of your inner loop you can obtain these e.g. ...
    grad_of_grads = torch.autograd.grad(
        meta_loss_fn(fmodel.parameters()), fmodel.parameters(time=0))

训练model 和使用diffopt.step 更新fmodel 之间的区别在于fmodel 没有像原始部分中的opt.step() 那样就地更新参数。相反，每次调用 diffopt.step 时，都会以这样的方式创建新版本的参数，fmodel 将在下一步使用新版本，但仍保留所有以前的参数。

即fmodel 开始时只有 fmodel.parameters(time=0) 可用，但在您调用 diffopt.step N 次后，您可以要求 fmodel 为您提供 fmodel.parameters(time=i) 任何 i 直到 N。请注意，fmodel.parameters(time=0) 在此过程中根本没有变化，只是每次将 fmodel 应用于某些输入时，它都会使用它当前拥有的最新版本的参数。

现在，fmodel.parameters(time=0) 到底是什么？它由here 创建并依赖于copy_initial_weights。如果copy_initial_weights==True 则fmodel.parameters(time=0) 是clone'd 和detach'ed 参数model。否则他们只是clone'd，而不是detach'ed！

这意味着当我们进行元优化步骤时，原始model 的参数实际上会累积梯度当且仅当copy_initial_weights==False。在 MAML 中，我们想要优化 model 的起始权重，所以我们确实需要从元优化步骤中获取梯度。

我认为这里的问题之一是higher 缺乏简单的玩具示例来演示正在发生的事情，而是急于展示更严肃的事情作为示例。因此，让我尝试在这里填补这个空白，并使用我能想到的最简单的玩具示例来演示发生了什么（具有 1 个权重的模型，它将输入乘以该权重）：

import torch
import torch.nn as nn
import torch.optim as optim
import higher
import numpy as np

np.random.seed(1)
torch.manual_seed(3)
N = 100
actual_multiplier = 3.5
meta_lr = 0.00001
loops = 5 # how many iterations in the inner loop we want to do

x = torch.tensor(np.random.random((N,1)), dtype=torch.float64) # features for inner training loop
y = x * actual_multiplier # target for inner training loop
model = nn.Linear(1, 1, bias=False).double() # simplest possible model - multiple input x by weight w without bias
meta_opt = optim.SGD(model.parameters(), lr=meta_lr, momentum=0.)


def run_inner_loop_once(model, verbose, copy_initial_weights):
    lr_tensor = torch.tensor([0.3], requires_grad=True)
    momentum_tensor = torch.tensor([0.5], requires_grad=True)
    opt = optim.SGD(model.parameters(), lr=0.3, momentum=0.5)
    with higher.innerloop_ctx(model, opt, copy_initial_weights=copy_initial_weights, override='lr': lr_tensor, 'momentum': momentum_tensor) as (fmodel, diffopt):
        for j in range(loops):
            if verbose:
                print('Starting inner loop step j==0'.format(j))
                print('    Representation of fmodel.parameters(time=0): 1'.format(j, str(list(fmodel.parameters(time=j)))))
                print('    Notice that fmodel.parameters() is same as fmodel.parameters(time=0): 1'.format(j, (list(fmodel.parameters())[0] is list(fmodel.parameters(time=j))[0])))
            out = fmodel(x)
            if verbose:
                print('    Notice how `out` is `x` multiplied by the latest version of weight: 0:.4 * 1:.4 == 2:.4'.format(x[0,0].item(), list(fmodel.parameters())[0].item(), out[0].item()))
            loss = ((out - y)**2).mean()
            diffopt.step(loss)

        if verbose:
            # after all inner training let's see all steps' parameter tensors
            print()
            print("Let's print all intermediate parameters versions after inner loop is done:")
            for j in range(loops+1):
                print('    For j==0 parameter is: 1'.format(j, str(list(fmodel.parameters(time=j)))))
            print()

        # let's imagine now that our meta-learning optimization is trying to check how far we got in the end from the actual_multiplier
        weight_learned_after_full_inner_loop = list(fmodel.parameters())[0]
        meta_loss = (weight_learned_after_full_inner_loop - actual_multiplier)**2
        print('  Final meta-loss: 0'.format(meta_loss.item()))
        meta_loss.backward() # will only propagate gradient to original model parameter's `grad` if copy_initial_weight=False
        if verbose:
            print('  Gradient of final loss we got for lr and momentum: 0 and 1'.format(lr_tensor.grad, momentum_tensor.grad))
            print('  If you change number of iterations "loops" to much larger number final loss will be stable and the values above will be smaller')
        return meta_loss.item()

print('=================== Run Inner Loop First Time (copy_initial_weights=True) =================\n')
meta_loss_val1 = run_inner_loop_once(model, verbose=True, copy_initial_weights=True)
print("\nLet's see if we got any gradient for initial model parameters: 0\n".format(list(model.parameters())[0].grad))

print('=================== Run Inner Loop Second Time (copy_initial_weights=False) =================\n')
meta_loss_val2 = run_inner_loop_once(model, verbose=False, copy_initial_weights=False)
print("\nLet's see if we got any gradient for initial model parameters: 0\n".format(list(model.parameters())[0].grad))

print('=================== Run Inner Loop Third Time (copy_initial_weights=False) =================\n')
final_meta_gradient = list(model.parameters())[0].grad.item()
# Now let's double-check `higher` library is actually doing what it promised to do, not just giving us
# a bunch of hand-wavy statements and difficult to read code.
# We will do a simple SGD step using meta_opt changing initial weight for the training and see how meta loss changed
meta_opt.step()
meta_opt.zero_grad()
meta_step = - meta_lr * final_meta_gradient # how much meta_opt actually shifted inital weight value
meta_loss_val3 = run_inner_loop_once(model, verbose=False, copy_initial_weights=False)

meta_loss_gradient_approximation = (meta_loss_val3 - meta_loss_val2) / meta_step

print()
print('Side-by-side meta_loss_gradient_approximation and gradient computed by `higher` lib: 0:.4 VS 1:.4'.format(meta_loss_gradient_approximation, final_meta_gradient))

产生这个输出：

=================== Run Inner Loop First Time (copy_initial_weights=True) =================

Starting inner loop step j==0
    Representation of fmodel.parameters(time=0): [tensor([[-0.9915]], dtype=torch.float64, requires_grad=True)]
    Notice that fmodel.parameters() is same as fmodel.parameters(time=0): True
    Notice how `out` is `x` multiplied by the latest version of weight: 0.417 * -0.9915 == -0.4135
Starting inner loop step j==1
    Representation of fmodel.parameters(time=1): [tensor([[-0.1217]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    Notice that fmodel.parameters() is same as fmodel.parameters(time=1): True
    Notice how `out` is `x` multiplied by the latest version of weight: 0.417 * -0.1217 == -0.05075
Starting inner loop step j==2
    Representation of fmodel.parameters(time=2): [tensor([[1.0145]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    Notice that fmodel.parameters() is same as fmodel.parameters(time=2): True
    Notice how `out` is `x` multiplied by the latest version of weight: 0.417 * 1.015 == 0.4231
Starting inner loop step j==3
    Representation of fmodel.parameters(time=3): [tensor([[2.0640]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    Notice that fmodel.parameters() is same as fmodel.parameters(time=3): True
    Notice how `out` is `x` multiplied by the latest version of weight: 0.417 * 2.064 == 0.8607
Starting inner loop step j==4
    Representation of fmodel.parameters(time=4): [tensor([[2.8668]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    Notice that fmodel.parameters() is same as fmodel.parameters(time=4): True
    Notice how `out` is `x` multiplied by the latest version of weight: 0.417 * 2.867 == 1.196

Let's print all intermediate parameters versions after inner loop is done:
    For j==0 parameter is: [tensor([[-0.9915]], dtype=torch.float64, requires_grad=True)]
    For j==1 parameter is: [tensor([[-0.1217]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    For j==2 parameter is: [tensor([[1.0145]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    For j==3 parameter is: [tensor([[2.0640]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    For j==4 parameter is: [tensor([[2.8668]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    For j==5 parameter is: [tensor([[3.3908]], dtype=torch.float64, grad_fn=<AddBackward0>)]

  Final meta-loss: 0.011927987982895929
  Gradient of final loss we got for lr and momentum: tensor([-1.6295]) and tensor([-0.9496])
  If you change number of iterations "loops" to much larger number final loss will be stable and the values above will be smaller

Let's see if we got any gradient for initial model parameters: None

=================== Run Inner Loop Second Time (copy_initial_weights=False) =================

  Final meta-loss: 0.011927987982895929

Let's see if we got any gradient for initial model parameters: tensor([[-0.0053]], dtype=torch.float64)

=================== Run Inner Loop Third Time (copy_initial_weights=False) =================

  Final meta-loss: 0.01192798770078706

Side-by-side meta_loss_gradient_approximation and gradient computed by `higher` lib: -0.005311 VS -0.005311

【讨论】：

只是为了让它 100% 清楚 - 这里玩具示例中的 meta_opt 步骤不是 MAML 所做的，它只是使用通过元丢失获得梯度的相同机制。嘿，亚历克斯，很抱歉没有说什么。我试图理解你的答案，但是当我copy_initial_weights = False 时发现p.clone() 更高，这让我很困惑。我想我现在的主要困惑是我不明白.clone() 并且整天都在阅读它。也许你可以帮忙。为什么这是正确的做法？我会假设，如果它不是复制，那么它就不是克隆，但这不是正在发生的事情，最终让我感到困惑。你知道发生了什么吗？ Alex，我认为在您的句子中“当且仅当 copy_initial_weights==True 时，原始模型的参数实际上会累积梯度。”您的意思是 copy_initial_weights==False，因为如果我们认为它是真的，那么会有一个 @987654408 @ 将梯度流切割成原始参数。这不对吗？所以 MAML 需要copy_initial_weights==False（实际上所有训练基础模型初始化的模型都需要它为 False）。正是this can be ueful when running the training loop at test time。如果您正在使用track_higher_grads=False，您只需告诉更高级别您不打算使用meta_loss.backward，并且它不必保留所有fnet.parameters(time=T)，除了最后一个。已经有一段时间（一年多）没有研究这个库了，所以可能很容易遗漏一些东西。我想你会想要将track_higher_grads 设置为False 进行元测试，在这种情况下，copy_initial_weights 的设置应该不重要。整体设置copy_initial_weights=True 似乎没问题 - 它可以确保梯度永远不会达到模型的初始权重，这是元测试所需的。【参考方案2】：

我认为现在这对我来说意味着什么或多或少清楚了。

首先我想明确一些符号，特别是关于内部时间步长和外部时间步长（也称为情节）的索引：

W^<inner_i, outer_i> = denotes the value a tensor has at time step inner_i, outer_i.

在训练神经网络开始时有参数：

W^<0,0>

并保存在它的模块内。为了解释起见，将表示特定的张量（用于基本模型）：

W = the weight holding the weights for the model. This can be thought as the initialization of the model.

并且将通过外部优化器使用就地操作进行更新（这很重要，因为W 是“正常”元学习期间所有外部步骤值的所有W^<0,outer_i> 的占位符）。我想强调W 是普通 Pytorch 神经网络基础模型的张量。通过使用外部优化器（如 Adam）就地改变这一点，我们有效地训练了初始化。外部优化器将使用此张量的梯度在整个展开的内部循环过程中进行更新。

当我们说copy_initial_weights=False 时，我们的意思是我们将有一个直接到W 的渐变路径，无论它当前具有什么值。通常，上下文管理器在外部步骤完成后的内部循环之前完成，因此W 将有W^<0,outer_i> 用于当前步骤。特别是执行此操作的代码是this one for copy_initial_weight=False：

params = [ p.clone() if device is None else p.clone().to(device) for p in module.parameters() ]

如果您不熟悉克隆，这可能看起来令人困惑，但它所做的是复制W 的当前权重。不寻常的是，克隆还记得它来自的张量的梯度历史（.clone() 是身份）。它的主要用途是为用户在其可微优化器中执行危险的就地操作添加额外的安全层。假设用户从未对就地操作做过任何疯狂的事情，理论上可以删除.clone()。恕我直言，这令人困惑的原因是因为“在 Pytorch 中复制”（紧贴）确实不会自动阻止梯度流，这是“真实”副本会做的事情（即创建一个 100% 完全独立的张量） .这不是 clone 所做的，也不是 copy_initial_weights 所做的。

当copy_initial_weights=True 时，真正发生的是权重克隆和分离。查看它最终运行的代码（here 和 here）：

params = [_copy_tensor(p, safe_copy, device) for p in module.parameters()]

运行复制张量（假设他们正在执行安全复制，即执行额外的克隆）：

 t = t.clone().detach().requires_grad_(t.requires_grad)

请注意，.detach() 不会分配新内存。它与原始张量共享内存，这就是为什么需要.clone() 才能让这个操作“安全”（通常是就地操作）。

所以当copy_initial_weights 他们正在复制和分离W 的当前值。这通常是W^<0,outer_i>，如果它在内部适应循环中进行通常的元学习。所以copy_initial_weight 的预期语义是，initial_weight 它们只是表示W。需要注意的重要一点是，内循环中网络的中间张量没有在我的符号中表示，但它们是fmodel.parameters(t=inner_i)。此外，如果事情通常是元学习，我们有fmodel.parameters(t=0) = W，它会由外部优化器就地更新。

请注意，由于外部优化器的就地操作和图形的释放，我们从不采用与 W 的初始值相关的派生 Grad_W^<0,0>。这是我最初认为我们正在做的事情。

【讨论】：

以上是关于Pytorch 的高级库中的 copy_initial_weights 文档是啥意思？的主要内容，如果未能解决你的问题，请参考以下文章