训练具有相同索引的测试拆分

Posted 2023-03-12

技术标签:

【中文标题】训练具有相同索引的测试拆分【英文标题】：Train Test Split with same index 【发布时间】：2018-12-14 17:12:01 【问题描述】：

我希望具有相同索引的行存在于同一个集合中 - 训练或测试，但不能同时存在。我怎样才能在sklearn中做到这一点？例如：

df = pd.DataFrame('A': [1, 1, 1, 2, 2, 3, 4, 4, 5, 6, 6, 6], 'B': random.sample(range(10, 100), 12))
df.set_index('A', inplace = True)

我想实现：

具有索引 1、3、5、6 的训练集索引为 2、4 的测试集

如何使用 GridSearchCV 来确保这一点？

【问题讨论】：

【参考方案1】：

将它们设置为'group'。 sklearn 中的大多数拆分器都支持其中一个名为 groups 的参数，可以将其设置为做你想做的事

例子：

您可以使用GroupKFold 或GroupShuffleSplit：

group_kfold = GroupKFold(n_splits=3)
for train_index, test_index in group_kfold.split(df, groups=df.index):
    print("Train", df.iloc[train_index].index)
    print("Test", df.iloc[test_index].index)

Output: 
('Train', Int64Index([1, 1, 1, 2, 2, 3, 4, 4], dtype='int64', name=u'A'))
('Test', Int64Index([5, 6, 6, 6], dtype='int64', name=u'A'))

('Train', Int64Index([2, 2, 4, 4, 5, 6, 6, 6], dtype='int64', name=u'A'))
('Test', Int64Index([1, 1, 1, 3], dtype='int64', name=u'A'))

('Train', Int64Index([1, 1, 1, 3, 5, 6, 6, 6], dtype='int64', name=u'A'))
('Test', Int64Index([2, 2, 4, 4], dtype='int64', name=u'A'))

您可以看到最后一次训练测试拆分符合您的要求。所有折叠都将包含训练或测试的数据，但不能同时包含两者。

【讨论】：

这是否也适用于 GridSearchCV？ @infinite-rotations 是的，您可以将其作为 cv 参数提供给 GridSearchCV

以上是关于训练具有相同索引的测试拆分的主要内容，如果未能解决你的问题，请参考以下文章