如何使用 PyTorch 拆分为 train_loader 和 test_loader?

Posted

技术标签:

【中文标题】如何使用 PyTorch 拆分为 train_loader 和 test_loader?【英文标题】:How to split into train_loader and test_loader using PyTorch? 【发布时间】:2022-01-07 09:37:28 【问题描述】:

我有一个使用 PyTorch 的项目,但我对此一无所知。

我有一个包含 7 列的 CSV,最后一个是标签,而前 6 个是特征。

我的项目说将数据随机拆分为 train_loader 和 test_loader。我已经制作了标签,但我确信它们是不正确的。

我的代码如下所示:

import torch
import torch.utils.data as data
import torch.utils.data.dataset as dataset
import numpy as np
import pickle
from sklearn.preprocessing import MinMaxScaler, StandardScaler


class Nav_Dataset(dataset.Dataset):
    def __init__(self):
        self.data = np.genfromtxt('saved/training_data.csv', delimiter=',')
        self.x = self.data[:, 0:5]
        self.y = self.data[:, [6]]
# it may be helpful for the final part to balance the distribution of your collected data

        # normalize data and save scaler for inference
        self.scaler = MinMaxScaler()
        self.normalized_data = self.scaler.fit_transform(self.data) #fits and transforms
        pickle.dump(self.scaler, open("saved/scaler.pkl", "wb")) #save to normalize at inference

    def __len__(self):
# __len__() returns the length of the dataset
        return len(self.y)

    def __getitem__(self, idx):
        if not isinstance(idx, int):
            idx = idx.item()
# for this example, __getitem__() must return a dict with entries 'input': x, 'label': y
# x and y should both be of type float32. There are many other ways to do this, but to work with autograding
# please do not deviate from these specifications.
        y = self.y[idx]
        x = self.x[idx]
        sample = 'input': x, 'label': y
        return sample


class Data_Loaders():
    def __init__(self, batch_size):
        self.nav_dataset = Nav_Dataset()
# randomly split dataset into two data.DataLoaders, self.train_loader and self.test_loader
# make sure your split can handle an arbitrary number of samples in the dataset as this may vary


def main():
    batch_size = 16
    data_loaders = Data_Loaders(batch_size)
    # note this is how the dataloaders will be iterated over, and cannot be deviated from
    for idx, sample in enumerate(data_loaders.train_loader):
        _, _ = sample['input'], sample['label']
    for idx, sample in enumerate(data_loaders.test_loader):
        _, _ = sample['input'], sample['label']

if __name__ == '__main__':
    main()

我无法理解如何制作训练和测试数据加载器以及拆分数据集。有什么办法解决吗?

【问题讨论】:

【参考方案1】:

PyTorch 专门为此提供了一个方便的实用函数,称为 random_split

from torch.utils.data import random_split, DataLoader

class Data_Loaders():
    def __init__(self, batch_size, split_prop=0.8):
        self.nav_dataset = Nav_Dataset()
        
        # compute number of samples
        self.N_train = int(len(self.nav_dataset) * 0.8)
        self.N_test = len(self.nav_dataset) - self.N_train

        self.train_set, self.test_set = random_split(self.nav_dataset, \
                                      [self.N_train, self.N_test])
        
        self.train_loader = DataLoader(self.train_set, batch_size=batch_size, ..)
        self.test_loader = DataLoader(self.test_set, batch_size=batch_size, ..)

首先计算两个子集的样本数,然后使用random_split 拆分它们。最后,从数据集创建数据加载器。

【讨论】:

以上是关于如何使用 PyTorch 拆分为 train_loader 和 test_loader?的主要内容,如果未能解决你的问题,请参考以下文章

pytorch数据拼接与拆分

如何将训练数据集拆分为训练,验证和测试数据集?

在 PyTorch 中使用 DataLoaders 进行 k 折交叉验证

使用 PyTorch 和 TorchVision 对自定义数据集进行训练-有效-测试拆分

如何在 PyTorch Lightning 中将数据集从 prepare_data() 获取到 setup()

[PyTroch系列-15]:PyTorch基础 - 张量的操作 - 拆分与分割