基于 python 中的多个特征的训练测试拆分的分层交叉验证或抽样

Posted 2023-03-12

技术标签:

【中文标题】基于 python 中的多个特征的训练测试拆分的分层交叉验证或抽样【英文标题】：Stratified Cross Validation or Sampling for train-test split based on multiple features in python 【发布时间】：2021-09-10 05:30:33 【问题描述】：

sklearn 的 train_test_split 、StratifiedShuffleSplit 和 StratifiedKFold 都基于类标签（y 变量或 target_column）进行分层。如果我们想要基于特征列（x 变量）而不是目标列进行抽样怎么办。如果它只是一个特征，那么很容易根据那一列进行分层，但是如果有很多特征列并且我们想保留所选样本中总体的比例怎么办？

在下面我创建了一个df，它使人口倾斜，低收入人群更多，女性更多，加州人最少，马萨诸塞州人最多。我希望所选样本具有这些特征，即更多的低收入人群、更多的女性、最少来自 CA 的人和大多数来自 MA

import random
import string
import pandas as pd
N = 20000 # Total rows in data
names    = [''.join(random.choices(string.ascii_uppercase, k = 5)) for _ in range(N)]
incomes  = [random.choices(['High','Low'], weights=(30, 70))[0] for _ in range(N)]
genders  = [random.choices(['M','F'], weights=(40, 60))[0] for _ in range(N)]
states   = [random.choices(['CA','IL','FL','MA'], weights=(10,20,30,40))[0] for _ in range(N)]
targets_y= [random.choice([0,1]) for _ in range(N)]

df = pd.DataFrame(dict(
        name     = names,
        income   = incomes,
        gender   = genders,
        state    = states,
        target_y = targets_y
    ))

当对于某些特征，我们只有很少的示例并且我们希望在选定的示例中包含至少n 示例时，会出现更多的复杂性。考虑这个例子：

single_row = 'name' : 'ABC',
'income' : 'High',
'gender' : 'F',
'state' : 'NY',
'target_y' : 1

df = df.append(single_row, ignore_index=True)

df

我希望这个添加的行始终包含在测试拆分中（n=1 此处）。

【问题讨论】：

【参考方案1】：

这可以使用 pandas groupby 来实现：

让我们首先检查人口特征：

grps = df.groupby(['state','income','gender'], group_keys=False)
grps.count()

接下来让我们用 20% 的原始数据创建一个测试集

test_proportion = 0.2
at_least = 1
test = grps.apply(lambda x: x.sample(max(round(len(x)*test_proportion), at_least)))
test

测试集特征：

test.groupby(['state','income','gender']).count()

接下来我们创建训练集作为原始 df 和测试集的差异

print('No. of samples in test  =', len(test))
train = set(df.name) - set(test.name)
print('No. of samples in train =', len(train))

没有。测试样本数 = 4000

没有。训练中的样本数 = 16001

【讨论】：

以上是关于基于 python 中的多个特征的训练测试拆分的分层交叉验证或抽样的主要内容，如果未能解决你的问题，请参考以下文章