如何在 Dataloader 类之外的 pytorch 中创建数据预处理管道?
Posted
技术标签:
【中文标题】如何在 Dataloader 类之外的 pytorch 中创建数据预处理管道?【英文标题】:How to create a data preprocessing pipeline in pytorch outside the Dataloader class? 【发布时间】:2021-07-10 13:03:45 【问题描述】:我正在尝试为具有 40 个特征的数据创建一个模型,这些特征必须分为 10 个类别。我是 PyTorch 的新手,这是我在其中的第一个项目。
我得到了一个自定义数据集类(我不能更改),如下所示:
class MyData(Dataset):
def _init_(self, mode):
with open(mode+'.pkl', 'rb') as handle:
data = pickle.load(handle)
self.X = data['x'].astype('float')
self.y = data['y'].astype('long')
def _len_(self):
return len(self.X)
def _getitem_(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
sample = (self.X[idx], self.y[idx])
return sample
我已经对数据进行了一些预处理,例如标准化,然后训练并保存了模型。由于不允许我更改数据集类,我在它之外进行了更改,然后使用了DataLoader
方法。预处理如下:
train_data=MyData("train")
features, labels = train_data[:]
df = pd.DataFrame(features)
x = df.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
input_array = x_scaled
output_array = labels
inputs = torch.Tensor(input_array)
targets = torch.Tensor(output_array).type(torch.LongTensor)
dataset = TensorDataset(inputs, targets)
train_ds, val_ds = random_split(dataset, [3300, 300])
batch_size = 300
n_epochs = 200
log_interval = 10
train_losses = []
train_counter = []
test_losses = []
train_loader = DataLoader(train_ds, batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size)
test_counter = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]
在此之后,我定义训练和测试功能(并删除打印语句,因为如果我这样做,自动评分器将无法对我的作业评分)如下:
def train(epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data.double())
loss = criterion(output, target)
loss.backward()
optimizer.step()
if batch_idx % log_interval == 0:
train_losses.append(loss.item())
train_counter.append(
(batch_idx*32) + ((epoch-1)*len(train_loader.dataset)))
save_model(model)
def test():
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in val_loader:
output = model(data.double())
test_loss += criterion(output, target).item()
pred = output.data.max(1, keepdim=True)[1]
correct += pred.eq(target.data.view_as(pred)).sum()
test_loss /= len(val_loader.dataset)
test_losses.append(test_loss)
test()
for epoch in range(1, n_epochs + 1):
train(epoch)
test()
即使这样做了,自动评分器仍然无法对我的代码评分。我主要认为这可能是因为我在将数据输入模型的方式上出错了,但我无法缩小问题的确切范围以及如何纠正它。由于我是 pytorch 的新手,我一直在研究如何进行预处理,但所有这些都涉及数据集类,所以我不知道该怎么做。
我的模型如下:
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
#self.flatten=nn.Flatten()
self.net_stack=nn.Sequential(
nn.Conv1d(in_channels=40, out_channels=256, kernel_size=1, stride=2), #applying batch norm
nn.ReLU(),
nn.MaxPool1d(kernel_size=1),
nn.Dropout(p=0.1),
nn.BatchNorm1d(256, affine=True),
nn.Conv1d(in_channels=256, out_channels=128, kernel_size=1, stride=2), #applying batch norm
nn.ReLU(),
nn.MaxPool1d(kernel_size=1),
nn.Dropout(p=0.1),
nn.BatchNorm1d(128, affine=True),
nn.Conv1d(in_channels=128, out_channels=64, kernel_size=1, stride=2), #applying batch norm
nn.ReLU(),
nn.MaxPool1d(kernel_size=1),
nn.Dropout(p=0.1),
nn.BatchNorm1d(64, affine=True),
nn.Conv1d(in_channels=64, out_channels=32, kernel_size=1, stride=2), #applying batch norm
nn.ReLU(),
nn.MaxPool1d(kernel_size=1),
nn.Dropout(p=0.1),
nn.BatchNorm1d(32, affine=True),
nn.Flatten(),
nn.Linear(32, 10),
nn.Softmax(dim=1)).double()
def forward(self,x):
# result=self.net_stack(x[None])
x=x.double()
result=self.net_stack(x[:, :, None]).double()
print(result.size())
return result
我得到的一条指令是他们写了:
# Please make sure we can load your model with:
# model = MyModel()
# This means you must give default values to all parameters you may wish to set, such as output size.
【问题讨论】:
【参考方案1】:你可以尝试在训练循环中这样做
for batch_idx, (data, target) in enumerate(train_loader):
# you can do something here to manipulate your input
data = transform(data)
data.to('cuda') # Move to gpu, i noticed you didnt do it in your training loop
# Forward pass
output = model(data)
【讨论】:
以上是关于如何在 Dataloader 类之外的 pytorch 中创建数据预处理管道?的主要内容,如果未能解决你的问题,请参考以下文章
如何使用 torch Dataloader 获取具有相同类的图片?
Singleton 类 DataLoader - 在 AppDelegate 中从 Core Data 设置值,但无法访问其他类中的 DataLoader 变量