如何使用tensorflow为每个类获取具有相同数量图像的验证集？

Posted 2023-02-16

技术标签:

【中文标题】如何使用tensorflow为每个类获取具有相同数量图像的验证集？【英文标题】：How to get validation set which has equal number of images for each class using tensorflow? 【发布时间】：2022-01-19 06:15:15 【问题描述】：

我现在使用 CIFAR-100 数据集来训练模型。我想使用 10% 的训练数据作为验证数据。我一开始就使用了下面的代码。

(train_images, train_labels), (test_images, test_labels) = datasets.cifar100.load_data()
train_images, val_images, train_labels, val_labels = train_test_split(train_images, train_labels, test_size=0.1)

train_db = tf.data.Dataset.from_tensor_slices((train_images, train_labels))
train_db = train_db.map(train_prep).shuffle(5000).repeat().batch(128).prefetch(-1)

val_db = tf.data.Dataset.from_tensor_slices((val_images, val_labels))
val_db = val_db.map(valid_prep).batch(512).prefetch(-1)

在某些型号中效果很好。但在其他一些模型中，验证准确度可能远高于测试准确度。我认为原因可能是使用train_test_split 不能保证验证集每个类具有相同数量的图像。所以我试图“手动”设置验证集。我的代码如下所示。

(train_images, train_labels), (test_images, test_labels) = datasets.cifar100.load_data()

def get_index(y):
  index = [[] for i in range(100)]
  for i in range(len(y)):
      for j in range(100):
          if y[i][0] == j:
              index[j].append(i)
  return index

index = get_index(train_labels)

index_train = []
index_val = []
for i in range(100):
  index1, index2 = train_test_split(index[i], test_size=0.1)
  index_train.extend(index1)
  index_val.extend(index2)

val_images = train_images[index_val]
train_images_1 = train_images[index_train]

val_labels = train_labels[index_val]
train_labels_1 = train_labels[index_train]

train_db = tf.data.Dataset.from_tensor_slices((train_images_1, train_labels_1))
train_db = train_db.map(train_prep).shuffle(5000).repeat().batch(128).prefetch(-1)

val_db = tf.data.Dataset.from_tensor_slices((val_images, val_labels))
val_db = val_db.map(valid_prep).batch(512).prefetch(-1)

但是当我使用这个训练集和验证集来训练我的模型时，准确率相当低。所以这种拆分方式肯定存在一些问题。但我不知道有什么问题。如果有人能帮我解决这个问题，我将不胜感激。

【问题讨论】：

【参考方案1】：

train_test_split 有一个名为 stratify 的参数可以帮助你。在下面的示例中，假设数据框 df 有 2 列。一个称为文件路径，其中列的每一行都包含图像文件的完整路径。第二列称为标签。列中的每一行都包含标识该行中图像的类别的文本。例如，如果您正在对狗和猫的图像进行分类，那么标签将是“狗”或“猫”。假设 80% 的图像是猫，20% 是狗。当您拆分数据集时，您可以确保完成拆分，以便生成的数据帧每个都有 80% 的猫图像和 20% 的狗图像。代码是

train_split=.8 # % of images to use for training
validation_split=.1 # % of images to use for validation
dsplit=validation_split/(1-train_split)
train_ds, dummy_df=train_test_split(df, train_size=train_split, shuffle=True, random_state=123, stratify=df['labels'])
valid_df, test_df=train_test_split(dummy_df, train_size=dsplit, shuffle=True, random_state=123, stratify=df['labels'])

结果是 3 个数据集 train_df、test_df 和 valid_df。每个数据集将具有与原始数据帧 df 相同的猫和狗类标签比率。现在要获得一个平衡的数据集，其中 50% 的标签是猫，50% 的标签是狗，您需要在采样、图像增强或两者的组合下进行。

【讨论】：

以上是关于如何使用tensorflow为每个类获取具有相同数量图像的验证集？的主要内容，如果未能解决你的问题，请参考以下文章