如何使用 PyTorch 拆分为 train_loader 和 test_loader?
Posted
技术标签:
【中文标题】如何使用 PyTorch 拆分为 train_loader 和 test_loader?【英文标题】:How to split into train_loader and test_loader using PyTorch? 【发布时间】:2022-01-07 09:37:28 【问题描述】:我有一个使用 PyTorch 的项目,但我对此一无所知。
我有一个包含 7 列的 CSV,最后一个是标签,而前 6 个是特征。
我的项目说将数据随机拆分为 train_loader 和 test_loader。我已经制作了标签,但我确信它们是不正确的。
我的代码如下所示:
import torch
import torch.utils.data as data
import torch.utils.data.dataset as dataset
import numpy as np
import pickle
from sklearn.preprocessing import MinMaxScaler, StandardScaler
class Nav_Dataset(dataset.Dataset):
def __init__(self):
self.data = np.genfromtxt('saved/training_data.csv', delimiter=',')
self.x = self.data[:, 0:5]
self.y = self.data[:, [6]]
# it may be helpful for the final part to balance the distribution of your collected data
# normalize data and save scaler for inference
self.scaler = MinMaxScaler()
self.normalized_data = self.scaler.fit_transform(self.data) #fits and transforms
pickle.dump(self.scaler, open("saved/scaler.pkl", "wb")) #save to normalize at inference
def __len__(self):
# __len__() returns the length of the dataset
return len(self.y)
def __getitem__(self, idx):
if not isinstance(idx, int):
idx = idx.item()
# for this example, __getitem__() must return a dict with entries 'input': x, 'label': y
# x and y should both be of type float32. There are many other ways to do this, but to work with autograding
# please do not deviate from these specifications.
y = self.y[idx]
x = self.x[idx]
sample = 'input': x, 'label': y
return sample
class Data_Loaders():
def __init__(self, batch_size):
self.nav_dataset = Nav_Dataset()
# randomly split dataset into two data.DataLoaders, self.train_loader and self.test_loader
# make sure your split can handle an arbitrary number of samples in the dataset as this may vary
def main():
batch_size = 16
data_loaders = Data_Loaders(batch_size)
# note this is how the dataloaders will be iterated over, and cannot be deviated from
for idx, sample in enumerate(data_loaders.train_loader):
_, _ = sample['input'], sample['label']
for idx, sample in enumerate(data_loaders.test_loader):
_, _ = sample['input'], sample['label']
if __name__ == '__main__':
main()
我无法理解如何制作训练和测试数据加载器以及拆分数据集。有什么办法解决吗?
【问题讨论】:
【参考方案1】:PyTorch 专门为此提供了一个方便的实用函数,称为 random_split
。
from torch.utils.data import random_split, DataLoader
class Data_Loaders():
def __init__(self, batch_size, split_prop=0.8):
self.nav_dataset = Nav_Dataset()
# compute number of samples
self.N_train = int(len(self.nav_dataset) * 0.8)
self.N_test = len(self.nav_dataset) - self.N_train
self.train_set, self.test_set = random_split(self.nav_dataset, \
[self.N_train, self.N_test])
self.train_loader = DataLoader(self.train_set, batch_size=batch_size, ..)
self.test_loader = DataLoader(self.test_set, batch_size=batch_size, ..)
首先计算两个子集的样本数,然后使用random_split
拆分它们。最后,从数据集创建数据加载器。
【讨论】:
以上是关于如何使用 PyTorch 拆分为 train_loader 和 test_loader?的主要内容,如果未能解决你的问题,请参考以下文章
在 PyTorch 中使用 DataLoaders 进行 k 折交叉验证
使用 PyTorch 和 TorchVision 对自定义数据集进行训练-有效-测试拆分