如何将训练数据集拆分为训练,验证和测试数据集?

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何将训练数据集拆分为训练,验证和测试数据集?相关的知识,希望对你有一定的参考价值。

我有一个自定义的图像数据集及其目标。我在PyTorch中创建了一个训练数据集。我想将它分为3个部分:培训,验证和测试。我该怎么做?

答案

获得“主”数据集后,可以使用data.Subset进行拆分。 这是随机拆分的一个例子

import torch
from torch.utils import data
import random

master = data.Dataset( ... )  # your "master" dataset
n = len(master)  # how many total elements you have
n_test = int( n * .05 )  # number of test/val elements
n_train = n - 2 * n_test
idx = list(range(n))  # indices to all elements
random.shuffle(idx)  # in-place shuffle the indices to facilitate random splitting
train_idx = idx[:n_train]
val_idx = idx[n_train:(n_train + n_test)]
test_idx = idx[(n_train + n_test):]

train_set = data.Subset(master, train_idx)
val_set = data.Subset(master, val_idx)
test_set = data.Subset(master, test_idx)

这也可以使用data.random_split来实现:

train_set, val_set, test_set = data.random_split(master, (n_train, n_val, n_test))
另一答案

给定参数train_frac=0.8,此函数将dataset分为80%,10%,10%:

import torch, itertools
from torch.utils.data import TensorDataset

def dataset_split(dataset, train_frac):
    '''
    param dataset:    Dataset object to be split
    param train_frac: Ratio of train set to whole dataset

    Randomly split dataset into a dictionary with keys, based on these ratios:
        'train': train_frac
        'valid': (1-split_frac) / 2
        'test': (1-split_frac) / 2
    '''
    assert split_frac >= 0 and split_frac <= 1, "Invalid training set fraction"

    length = len(dataset)

    # Use int to get the floor to favour allocation to the smaller valid and test sets    
    train_length = int(length * train_frac)
    valid_length = int((length - train_length) / 2)
    test_length  = length - train_length - valid_length

    dataset = random_split(dataset, (train_length, valid_length, test_length))
    dataset = {name: set for name, set in zip(('train', 'valid', 'test'), sets)}
    return dataset

以上是关于如何将训练数据集拆分为训练,验证和测试数据集?的主要内容,如果未能解决你的问题,请参考以下文章

将图像数组和标签数据帧拆分为训练、测试和验证集

将主数据目录拆分为训练/验证/测试集

如何在不使用和拆分测试集的情况下将我的数据集拆分为训练和验证?

数据集拆分:训练集、验证集、测试集

将 pandas 数据帧分层拆分为训练、验证和测试集

如何将数据集 (csv) 拆分为训练和测试数据