如何在训练、验证、测试样本中选择几乎均匀分布的类？

Posted 2023-03-12

技术标签:

【中文标题】如何在训练、验证、测试样本中选择几乎均匀分布的类？【英文标题】：How to select almost equally distributed classes in training,validation,test samples? 【发布时间】：2020-09-20 12:32:27 【问题描述】：

我正在研究MNIST Sign Language 数据集，以使用Keras 对图像进行分类。数据集中有 24 个不同的类。但问题是类的分布很不一样。

我使用了sklearn.model_selection.train_test_split 到stratify=df['label']，但仍然有一些类有 5%，而另一些有 3% 的全部数据。我怎样才能让他们选择分布在类中的大约 4% 的数据。

我的test_df 有 7172 行和 785 列，其中一个是label 列，其余的784 是灰度像素值(28*28)

test_df = pd.read_csv(TEST_PATH)

# shuffle and split validation,test data
test_df = test_df.sample(frac=1.0,random_state=SEED).iloc[:2000,:] # shuffle the whole data, get first 2000 rows
val_df,test_df = train_test_split(test_df,test_size=0.5,random_state=SEED,stratify=test_df['label'])
# stratify the labels so that distribution of classes is almost same

# extract pixels and labels for both validation,test data
X_val = val_df.drop('label',axis=1).values.reshape((val_df.shape[0],28,28))/255.0 # validation images
y_val = val_df['label'].ravel() # validation label

X_test = test_df.drop('label',axis=1).values.reshape((test_df.shape[0],28,28))/255.0 # test images
y_val = test_df['label'].ravel() # test label

【问题讨论】：

【参考方案1】：

此行使您可以使用 val 和 test 进行均匀分布。您也可以使用样本数来玩

SEED = 42
n_classes = 24

test_df = pd.read_csv(TEST_PATH)

test_df = [test_df.loc[test_df.label==i].sample(n=int(2000/n_classes),random_state=SEED) for i in test_df.label.unique()]
test_df = pd.concat(test_df, axis=0, ignore_index=True)

val_df,test_df = train_test_split(test_df,test_size=0.5,random_state=SEED,stratify=test_df['label'])

【讨论】：

这会授予您相等的 dist 标签。检查 pd.value_counts(y_val) ...如果选择 2000 个样本并有 24 个类，则不能有 1000 个。和n一起玩总有4个短的。 2500 给出 2496 个图像，2000 给出 996 个。将尝试更改n 为了实现平等分配，这就是发生的事情

以上是关于如何在训练、验证、测试样本中选择几乎均匀分布的类？的主要内容，如果未能解决你的问题，请参考以下文章