如何以干净有效的方式在pytorch中获得小批量？

Posted 2023-02-23

技术标签:

【中文标题】如何以干净有效的方式在pytorch中获得小批量？【英文标题】：How to get mini-batches in pytorch in a clean and efficient way? 【发布时间】：2017-12-20 03:52:37 【问题描述】：

我正在尝试做一件简单的事情，即使用 Torch 使用随机梯度下降 (SGD) 训练线性模型：

import numpy as np

import torch
from torch.autograd import Variable

import pdb

def get_batch2(X,Y,M,dtype):
    X,Y = X.data.numpy(), Y.data.numpy()
    N = len(Y)
    valid_indices = np.array( range(N) )
    batch_indices = np.random.choice(valid_indices,size=M,replace=False)
    batch_xs = torch.FloatTensor(X[batch_indices,:]).type(dtype)
    batch_ys = torch.FloatTensor(Y[batch_indices]).type(dtype)
    return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)

def poly_kernel_matrix( x,D ):
    N = len(x)
    Kern = np.zeros( (N,D+1) )
    for n in range(N):
        for d in range(D+1):
            Kern[n,d] = x[n]**d;
    return Kern

## data params
N=5 # data set size
Degree=4 # number dimensions/features
D_sgd = Degree+1
##
x_true = np.linspace(0,1,N) # the real data points
y = np.sin(2*np.pi*x_true)
y.shape = (N,1)
## TORCH
dtype = torch.FloatTensor
# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU
X_mdl = poly_kernel_matrix( x_true,Degree )
X_mdl = Variable(torch.FloatTensor(X_mdl).type(dtype), requires_grad=False)
y = Variable(torch.FloatTensor(y).type(dtype), requires_grad=False)
## SGD mdl
w_init = torch.zeros(D_sgd,1).type(dtype)
W = Variable(w_init, requires_grad=True)
M = 5 # mini-batch size
eta = 0.1 # step size
for i in range(500):
    batch_xs, batch_ys = get_batch2(X_mdl,y,M,dtype)
    # Forward pass: compute predicted y using operations on Variables
    y_pred = batch_xs.mm(W)
    # Compute and print loss using operations on Variables. Now loss is a Variable of shape (1,) and loss.data is a Tensor of shape (1,); loss.data[0] is a scalar value holding the loss.
    loss = (1/N)*(y_pred - batch_ys).pow(2).sum()
    # Use autograd to compute the backward pass. Now w will have gradients
    loss.backward()
    # Update weights using gradient descent; w1.data are Tensors,
    # w.grad are Variables and w.grad.data are Tensors.
    W.data -= eta * W.grad.data
    # Manually zero the gradients after updating weights
    W.grad.data.zero_()

#
c_sgd = W.data.numpy()
X_mdl = X_mdl.data.numpy()
y = y.data.numpy()
#
Xc_pinv = np.dot(X_mdl,c_sgd)
print('J(c_sgd) = ', (1/N)*(np.linalg.norm(y-Xc_pinv)**2) )
print('loss = ',loss.data[0])

代码运行良好，尽管我的 get_batch2 方法看起来很愚蠢/天真，这可能是因为我是 pytorch 的新手，但我还没有找到一个讨论如何检索数据批次的好地方。我浏览了他们的教程（http://pytorch.org/tutorials/beginner/pytorch_with_examples.html）和数据集（http://pytorch.org/tutorials/beginner/data_loading_tutorial.html），但没有运气。教程似乎都假设一开始就已经有了批和批大小，然后继续使用该数据进行训练而不更改它（具体看http://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-variables-and-autograd）。

所以我的问题是我真的需要将我的数据转回 numpy 以便我可以获取它的一些随机样本，然后将其转回带有变量的 pytorch 以便能够在内存中进行训练吗？有没有办法用torch获取小批量？

我查看了 torch 提供的一些功能，但没有运气：

#pdb.set_trace()
#valid_indices = torch.arange(0,N).numpy()
#valid_indices = np.array( range(N) )
#batch_indices = np.random.choice(valid_indices,size=M,replace=False)
#indices = torch.LongTensor(batch_indices)
#batch_xs, batch_ys = torch.index_select(X_mdl, 0, indices), torch.index_select(y, 0, indices)
#batch_xs,batch_ys = torch.index_select(X_mdl, 0, indices), torch.index_select(y, 0, indices)

即使我提供的代码运行良好，但我担心它不是一个有效的实现，而且如果我使用 GPU，速度会进一步降低（因为我猜它会将东西放入内存然后获取它们回去把它们放 GPU 那样很傻）。

我根据建议使用torch.index_select()的答案实现了一个新的：

def get_batch2(X,Y,M):
    '''
    get batch for pytorch model
    '''
    # TODO fix and make it nicer, there is pytorch forum question
    #X,Y = X.data.numpy(), Y.data.numpy()
    X,Y = X, Y
    N = X.size()[0]
    batch_indices = torch.LongTensor( np.random.randint(0,N+1,size=M) )
    pdb.set_trace()
    batch_xs = torch.index_select(X,0,batch_indices)
    batch_ys = torch.index_select(Y,0,batch_indices)
    return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)

然而，这似乎有问题，因为如果X,Y 不是变量，它就不起作用......这真的很奇怪。我将此添加到 pytorch 论坛：https://discuss.pytorch.org/t/how-to-get-mini-batches-in-pytorch-in-a-clean-and-efficient-way/10322

现在我正在努力使这项工作适用于 gpu。我的最新版本：

def get_batch2(X,Y,M,dtype):
    '''
    get batch for pytorch model
    '''
    # TODO fix and make it nicer, there is pytorch forum question
    #X,Y = X.data.numpy(), Y.data.numpy()
    X,Y = X, Y
    N = X.size()[0]
    if dtype ==  torch.cuda.FloatTensor:
        batch_indices = torch.cuda.LongTensor( np.random.randint(0,N,size=M) )# without replacement
    else:
        batch_indices = torch.LongTensor( np.random.randint(0,N,size=M) ).type(dtype)  # without replacement
    pdb.set_trace()
    batch_xs = torch.index_select(X,0,batch_indices)
    batch_ys = torch.index_select(Y,0,batch_indices)
    return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)

错误：

RuntimeError: tried to construct a tensor from a int sequence, but found an item of type numpy.int64 at index (0)

我不明白，我真的必须这样做吗：

ints = [ random.randint(0,N) for i i range(M)]

获取整数？

如果数据可以是变量，这也是理想的。看来torch.index_select 对Variable 类型数据不起作用。

这个整数列表仍然不起作用：

TypeError: torch.addmm received an invalid combination of arguments - got (int, torch.cuda.FloatTensor, int, torch.cuda.FloatTensor, torch.FloatTensor, out=torch.cuda.FloatTensor), but expected one of:
 * (torch.cuda.FloatTensor source, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
 * (torch.cuda.FloatTensor source, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
 * (float beta, torch.cuda.FloatTensor source, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
 * (torch.cuda.FloatTensor source, float alpha, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
 * (float beta, torch.cuda.FloatTensor source, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
 * (torch.cuda.FloatTensor source, float alpha, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
 * (float beta, torch.cuda.FloatTensor source, float alpha, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
      didn't match because some of the arguments have invalid types: (int, torch.cuda.FloatTensor, int, torch.cuda.FloatTensor, torch.FloatTensor, out=torch.cuda.FloatTensor)
 * (float beta, torch.cuda.FloatTensor source, float alpha, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
      didn't match because some of the arguments have invalid types: (int, torch.cuda.FloatTensor, int, torch.cuda.FloatTensor, torch.FloatTensor, out=torch.cuda.FloatTensor)

【问题讨论】：

在pytorch论坛上：discuss.pytorch.org/t/… 可能有用的注释：确保使用相同类型的参数调用index_select，即两个张量或两个变量。将您的 batch_indices 包装到变量中或仅使用 X[batch_indices, :]。也相关：discuss.pytorch.org/t/… 【参考方案1】：

使用数据加载器。

数据集

首先定义一个数据集。您可以使用 torchvision.datasets 中的包数据集或使用遵循 Imagenet 结构的 ImageFolder 数据集类。

trainset=torchvision.datasets.ImageFolder(root='/path/to/your/data/trn', transform=generic_transform)
testset=torchvision.datasets.ImageFolder(root='/path/to/your/data/val', transform=generic_transform)

变换

转换对于动态预处理加载的数据非常有用。如果您使用图像，则必须使用ToTensor() 转换将加载的图像从PIL 转换为torch.tensor。更多的变换可以打包成一个合成变换，如下所示。

generic_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.ToPILImage(),
    #transforms.CenterCrop(size=128),
    transforms.Lambda(lambda x: myimresize(x, (128, 128))),
    transforms.ToTensor(),
    transforms.Normalize((0., 0., 0.), (6, 6, 6))
])

数据加载器

然后您定义一个数据加载器，它在训练时准备下一批。您可以设置数据加载的线程数。

trainloader=torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True, num_workers=8)
testloader=torch.utils.data.DataLoader(testset, batch_size=32, shuffle=False, num_workers=8)

对于训练，您只需在数据加载器上进行枚举。

  for i, data in enumerate(trainloader, 0):
    inputs, labels = data    
    inputs, labels = Variable(inputs.cuda()), Variable(labels.cuda())
    # continue training...

NumPy 的东西

是的。您必须使用 .numpy() 方法将 torch.tensor 转换为 numpy 才能处理它。如果您使用 CUDA，您必须先使用 .cpu() 方法将数据从 GPU 下载到 CPU，然后再调用 .numpy()。就个人而言，来自 MATLAB 背景，我更喜欢使用 Torch 张量完成大部分工作，然后将数据转换为 numpy 仅用于可视化。还要记住，torch 以通道优先模式存储数据，而 numpy 和 PIL 使用通道最后模式。这意味着您需要使用np.rollaxis 将通道轴移动到最后一个。示例代码如下。

np.rollaxis(make_grid(mynet.ftrextractor(inputs).data, nrow=8, padding=1).cpu().numpy(), 0, 3)

记录

我发现可视化特征图的最佳方法是使用张量板。代码可在yunjey/pytorch-tutorial 获得。

【讨论】：

张量板链接已断开，您也可以使用np.transpose() 将通道优先转换为通道最后表示。如果我的数据集只是一个 numpy 数组，我该如何使用您的解决方案？对于这个菜鸟问题，我很困惑。【参考方案2】：

不确定你想做什么。 W.r.t.批处理您不必转换为 numpy.您可以只使用 index_select() ，例如：

for epoch in range(500):
    k=0
    loss = 0
    while k < X_mdl.size(0):

        random_batch = [0]*5
        for i in range(k,k+M):
            random_batch[i] = np.random.choice(N-1)
        random_batch = torch.LongTensor(random_batch)
        batch_xs = X_mdl.index_select(0, random_batch)
        batch_ys = y.index_select(0, random_batch)

        # Forward pass: compute predicted y using operations on Variables
        y_pred = batch_xs.mul(W)
        # etc..

不过，其余代码也必须更改。

我猜，你想创建一个 get_batch 函数来连接你的 X 张量和 Y 张量。比如：

def make_batch(list_of_tensors):
    X, y = list_of_tensors[0]
    # may need to unsqueeze X and y to get right dimensions
    for i, (sample, label) in enumerate(list_of_tensors[1:]):
        X = torch.cat((X, sample), dim=0)
        y = torch.cat((y, label), dim=0)
    return X, y

然后在训练期间您选择，例如max_batch_size = 32，通过切片的例子。

for epoch:
  X, y = make_batch(list_of_tensors)
  X = Variable(X, requires_grad=False)
  y = Variable(y, requires_grad=False)

  k = 0   
   while k < X.size(0):
     inputs = X[k:k+max_batch_size,:]
     labels = y[k:k+max_batch_size,:]
     # some computation
     k+= max_batch_size

【讨论】：

真的很烦人，index_select() 要求数据不能是变量...为什么？可能是因为仅更改变量内的部分数据不会启用梯度计算。但是数据总是有它的requires_grad=False...这有什么关系？你是对的， requires_grad 只是一个布尔值，指示变量是否由子图创建。 data 变量不应该需要 grad，因为无论如何您都会覆盖原始内容。显然，您可以 index_select 带有变量的变量：discuss.pytorch.org/t/indexing-a-variable-with-a-variable/2111 我对一件事感到困惑 index_select() 与直接索引 X[k1:k2,:] 之间有什么区别？比如我们什么时候使用一个与另一个？【参考方案3】：

如果我正确理解了您的代码，您的 get_batch2 函数似乎正在从您的数据集中随机抽取小批量，而没有跟踪您在某个时期已经使用过哪些索引。此实现的问题在于它可能不会使用您的所有数据。

我通常进行批处理的方式是使用torch.randperm(N) 创建所有可能顶点的随机排列，并分批循环遍历它们。例如：

n_epochs = 100 # or whatever
batch_size = 128 # or whatever

for epoch in range(n_epochs):

    # X is a torch Variable
    permutation = torch.randperm(X.size()[0])

    for i in range(0,X.size()[0], batch_size):
        optimizer.zero_grad()

        indices = permutation[i:i+batch_size]
        batch_x, batch_y = X[indices], Y[indices]

        # in case you wanted a semi-full example
        outputs = model.forward(batch_x)
        loss = lossfunction(outputs,batch_y)

        loss.backward()
        optimizer.step()

如果您喜欢复制和粘贴，请确保在 epoch 循环开始之前的某个位置定义优化器、模型和损失函数。

关于您的错误，请尝试使用torch.from_numpy(np.random.randint(0,N,size=M)).long() 而不是torch.LongTensor(np.random.randint(0,N,size=M))。我不确定这是否会解决您遇到的错误，但它会解决未来的错误。

【讨论】：

torch.randperm(N) 有什么帮助？它有两个帮助。第一个是它确保 X 中的每个数据点都在单个 epoch 中进行采样。使用所有数据来帮助您的模型泛化通常是一件好事。它帮助的第二种方式是实现起来相对简单。您不必像get_batch2() 那样制作整个函数。我不知道人们实际上会跟踪他们看到的索引，这是标准做法吗？我认为它只是获取数据而不替换什么是常见的做法，至少在神经网络中，不是吗？是的，重要的部分是确保数据在一个 epoch 中不重复，并且在每个 epoch 中使用所有数据。否则，模型可能会过度拟合某些特定数据，并且可能更难以泛化到看不见的测试数据。跟踪索引只是实现此目标的一种简单方法。另一种方法是在每个纪元开始时对数据进行洗牌。什么都行。看起来您的示例代码可能会重用某些数据并忽略一个时期内的其他数据。抱歉，如果我误解了您的代码。使用索引排列的一个好处是，无论您使用哪个框架，都可以使用它。 Numpy 有np.random.permutation()，所以如果你使用的是 tensorflow，它很容易做到。【参考方案4】：

创建一个作为torch.utils.data.Dataset 子类的类并将其传递给torch.utils.data.Dataloader。以下是我的项目的示例。

class CandidateDataset(Dataset):
    def __init__(self, x, y):
        self.len = x.shape[0]
        if torch.cuda.is_available():
            device = 'cuda'
        else:
            device = 'cpu'
        self.x_data = torch.as_tensor(x, device=device, dtype=torch.float)
        self.y_data = torch.as_tensor(y, device=device, dtype=torch.long)

    def __getitem__(self, index):
        return self.x_data[index], self.y_data[index]

    def __len__(self):
        return self.len

def fit(self, candidate_count):
        feature_matrix = np.empty(shape=(candidate_count, 600))
        target_matrix = np.empty(shape=(candidate_count, 1))
        fill_matrices(feature_matrix, target_matrix)
        candidate_ds = CandidateDataset(feature_matrix, target_matrix)
        train_loader = DataLoader(dataset = candidate_ds, batch_size = self.BATCH_SIZE, shuffle = True)
        for epoch in range(self.N_EPOCHS):
            print('starting epoch ' + str(epoch))
            for batch_idx, (inputs, labels) in enumerate(train_loader):
                print('starting batch ' + str(batch_idx) + ' epoch ' + str(epoch))
                inputs, labels = Variable(inputs), Variable(labels)
                self.optimizer.zero_grad()
                inputs = inputs.view(1, inputs.size()[0], 600)
                # init hidden with number of rows in input
                y_pred = self.model(inputs, self.model.initHidden(inputs.size()[1]))
                labels.squeeze_()
                # labels should be tensor with batch_size rows. Column the index of the class (0 or 1)
                loss = self.loss_f(y_pred, labels)
                loss.backward()
                self.optimizer.step()
                print('done batch ' + str(batch_idx) + ' epoch ' + str(epoch))

【讨论】：

【参考方案5】：

您可以使用torch.utils.data

假设您已经从目录中加载了数据，在训练和测试 numpy 数组中，您可以从 torch.utils.data.Dataset 类继承来创建您的数据集对象

class MyDataset(Dataset):
    def __init__(self, x, y):
        super(MyDataset, self).__init__()
        assert x.shape[0] == y.shape[0] # assuming shape[0] = dataset size
        self.x = x
        self.y = y


    def __len__(self):
        return self.y.shape[0]

    def __getitem__(self, index):
        return self.x[index], self.y[index]

然后，创建您的数据集对象

traindata = MyDataset(train_x, train_y)

最后，使用DataLoader 创建您的小批量

trainloader = torch.utils.data.DataLoader(traindata, batch_size=64, shuffle=True)

【讨论】：

【参考方案6】：

另一种方法是使用pd.DataFrame.sample

train = pd.read_csv(TrainSetPath)
test = pd.read_csv(TestSetPath)

# use df.sample() to shuffle the data frame 
train = train.sample(frac=1)
test = test.sample(frac=1)

for i in range(epochs):
        for j in range(batch_per_epoch):
            train_batch = train.sample(n=BatchSize, axis='index',replace=True)
            y_train = train_batch['Target']
            X_train = train_batch.drop(['Target'], axis=1)
            
            # convert data frames to tensors and send them to GPU (if used)
            X_train = torch.tensor(np.mat(X_train)).float().to(device)
            y_train = torch.tensor(np.mat(y_train)).float().to(device)

【讨论】：

以上是关于如何以干净有效的方式在pytorch中获得小批量？的主要内容，如果未能解决你的问题，请参考以下文章