如何获得可重现但不同的 GroupKFold 实例
Posted
技术标签:
【中文标题】如何获得可重现但不同的 GroupKFold 实例【英文标题】:How to obtain reproducible but distinct instances of GroupKFold 【发布时间】:2017-06-11 02:33:54 【问题描述】:在 GroupKFold
源中,random_state
设置为 None
def __init__(self, n_splits=3):
super(GroupKFold, self).__init__(n_splits, shuffle=False,
random_state=None)
因此,当多次运行时(代码来自here)
import numpy as np
from sklearn.model_selection import GroupKFold
for i in range(0,10):
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)
print(group_kfold)
for train_index, test_index in group_kfold.split(X, y, groups):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)
print
print
o/p
GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
[7, 8]]), array([[1, 2],
[3, 4]]), array([3, 4]), array([1, 2]))
GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
[7, 8]]), array([[1, 2],
[3, 4]]), array([3, 4]), array([1, 2]))
等等……
分割是相同的。
如何为GroupKFold
设置random_state
,以便在几个不同的交叉验证试验中获得不同的(但可重复的)拆分集?
例如,我想要
GroupKFold(n_splits=2, random_state=42)
('TRAIN:', array([0, 1]),
'TEST:', array([2, 3]))
('TRAIN:', array([2, 3]),
'TEST:', array([0, 1]))
GroupKFold(n_splits=2, random_state=13)
('TRAIN:', array([0, 2]),
'TEST:', array([1, 3]))
('TRAIN:', array([1, 3]),
'TEST:', array([0, 2]))
到目前为止,似乎一种策略可能是首先使用sklearn.utils.shuffle
,正如post 中所建议的那样。然而,这实际上只是重新排列了每个折叠的元素——它并没有给我们新的拆分。
from sklearn.utils import shuffle
from sklearn.model_selection import GroupKFold
import numpy as np
import sys
import pdb
random_state = int(sys.argv[1])
X = np.arange(20).reshape((10,2))
y = np.arange(10)
groups = np.array([0,0,0,1,2,3,4,5,6,7])
def cv(X, y, groups, random_state):
X_s, y_s, groups_s = shuffle(X,y, groups, random_state=random_state)
cv_out = GroupKFold(n_splits=2)
cv_out_splits = cv_out.split(X_s, y_s, groups_s)
for train, test in cv_out_splits:
print "---"
print X_s[test]
print y_s[test]
print "test groups", groups_s[test]
print "train groups", groups_s[train]
pdb.set_trace()
print "***"
cv(X, y, groups, random_state)
输出:
>python sshuf.py 32
***
---
[[ 2 3]
[ 4 5]
[ 0 1]
[ 8 9]
[12 13]]
[1 2 0 4 6]
test groups [0 0 0 2 4]
train groups [7 6 1 3 5]
---
[[18 19]
[16 17]
[ 6 7]
[10 11]
[14 15]]
[9 8 3 5 7]
test groups [7 6 1 3 5]
train groups [0 0 0 2 4]
>python sshuf.py 234
***
---
[[12 13]
[ 4 5]
[ 0 1]
[ 2 3]
[ 8 9]]
[6 2 0 1 4]
test groups [4 0 0 0 2]
train groups [7 3 1 5 6]
---
[[18 19]
[10 11]
[ 6 7]
[14 15]
[16 17]]
[9 5 3 7 8]
test groups [7 3 1 5 6]
train groups [4 0 0 0 2]
【问题讨论】:
我认为这是一个错误。我打开了一个错误报告。如果我下班后有时间,我可能会自己去修理。 github.com/scikit-learn/scikit-learn/issues/9323 【参考方案1】:子类化和实现
random_state
依赖 _iter_test_masks( ... random_state = None )
方法,
正如 sci-kit super(...)
的源代码中所记录的那样。 random_state
参数,用于实例化(.__init__()
是
“只是”为用户的创造力而存储和留下,如果它会或不会以任何自定义方式用于test_mask
一代(如 sci-kit 源 cmets 中的字面表达):
(cit.:)
# Since subclasses must implement either _iter_test_masks or
# _iter_test_indices, neither can be abstract.
def _iter_test_masks(self, X=None, y=None, groups=None):
"""Generates boolean masks corresponding to test sets.
By default, delegates to _iter_test_indices(X, y, groups)
"""
for test_index in self._iter_test_indices(X, y, groups):
test_mask = np.zeros(_num_samples(X), dtype=np.bool)
test_mask[test_index] = True
yield test_mask
定义一个依赖于外部提供的random_state != None
的进程还应该执行一个公平的做法来保护 - 保存/存储 RNG 的实际当前状态(RNG_stateTUPLE = numpy.random.get_state()
),设置从.__init__()
调用提供的那个界面,完成后,从保存的(numpy.random.set_state( RNG_stateTUPLE )
)恢复RNG状态。
通过这种方式,这样的自定义流程既获得了对 random_state
值的所需依赖性,又获得了可重复性。
Q.E.D.
【讨论】:
我不确定我是否遵循;你能在玩具示例中展示它是如何实现的吗?【参考方案2】:KFold
仅在 shuffle=True
时随机化。 Some datasets should not be shuffled.
GroupKFold
根本不是随机的。因此random_state=None
。
GroupShuffleSplit
可能更接近您要查找的内容。
基于组的拆分器的比较:
在GroupKFold
中,测试集形成了所有数据的完整分区。
LeavePGroupsOut
将所有可能的 P 组子集组合起来; P > 1 时测试集将重叠。由于这意味着 P ** n_groups
完全分裂,因此您通常需要一个小的 P,并且通常需要 LeaveOneGroupOut
,这与 GroupKFold
和 k=1
基本相同。
GroupShuffleSplit
没有说明连续测试集之间的关系;每个训练/测试拆分都是独立执行的。
顺便说一句,
Dmytro Lituiev has proposed an alternative GroupShuffleSplit
algorithm 更擅长在指定 test_size
的测试集中获得正确数量的样本(而不仅仅是正确数量的组)。
【讨论】:
我看到了GroupKFold
---我误解了random_state
的随机化。我对GroupShuffleSplit
和GroupKFold
之间的区别感到困惑。例如,使用 3 个拆分,GroupKFold
产生 3 个唯一的测试集。 GroupShuffleSplit
但是可能(以低概率)生成 3 个相同的测试集?
GroupShuffleSplit
是否保证测试和训练集中不代表相同的组?能够指定replacement=False
会很好
其实对于这件事LeavePGroupsOut
和GroupKFold
有什么区别?
感谢您的编辑,但我仍然不知道如何使用它来生成形成完整数据分区的测试集,同时还需要一些 random_state
以便我可以运行它多次没有得到多个相同的简历结果。最好的选择似乎是(上面建议的)改组然后使用 GroupKFold,但是当GroupKFold
被包装在一个函数中时,我发现这不一定表现得很好。 GroupShuffleSplit
并不是真正的 k 折交叉验证,我不确定它是否受益于相同的属性....
...但我已经阅读(例如,统计学习要素)将数据拆分为随机测试火车拆分不如交叉验证。【参考方案3】:
到目前为止,我的解决方案是简单地随机拆分组。这可能会导致非常不平衡的群体(我认为 GroupKFold
旨在避免),但希望每个群体的观察数量很少。
from sklearn.utils import shuffle
from sklearn.model_selection import GroupKFold
from numpy.random import RandomState
import numpy as np
import sys
import pdb
random_state = int(sys.argv[1])
X = np.arange(20).reshape((10,2))
y = np.arange(10)
groups = np.array([0,0,0,1,2,3,4,5,6,7])
for el in zip(range(len(y)),X,y,groups):
print "ix, X, y, groups", el
def RandGroupKfold(groups, n_splits, random_state=None, shuffle_groups=False):
ix = np.array(range(len(groups)))
unique_groups = np.unique(groups)
if shuffle_groups:
prng = RandomState(random_state)
prng.shuffle(unique_groups)
splits = np.array_split(unique_groups, n_splits)
train_test_indices = []
for split in splits:
mask = [el in split for el in groups]
train = ix[np.invert(mask)]
test = ix[mask]
train_test_indices.append((train, test))
return train_test_indices
splits = RandGroupKfold(groups, n_splits=3, random_state=random_state, shuffle_groups=True)
for train, test in splits:
print "---"
for el in zip(train, X[train], y[train], groups[train]):
print "train ix, X, y, groups", el
for el in zip(test, X[test], y[test], groups[test]):
print "test ix, X, y, groups", el
数据:
ix, X, y, groups (0, array([0, 1]), 0, 0)
ix, X, y, groups (1, array([2, 3]), 1, 0)
ix, X, y, groups (2, array([4, 5]), 2, 0)
ix, X, y, groups (3, array([6, 7]), 3, 1)
ix, X, y, groups (4, array([8, 9]), 4, 2)
ix, X, y, groups (5, array([10, 11]), 5, 3)
ix, X, y, groups (6, array([12, 13]), 6, 4)
ix, X, y, groups (7, array([14, 15]), 7, 5)
ix, X, y, groups (8, array([16, 17]), 8, 6)
ix, X, y, groups (9, array([18, 19]), 9, 7)
随机状态为 4
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
test ix, X, y, groups (5, array([10, 11]), 5, 3)
test ix, X, y, groups (6, array([12, 13]), 6, 4)
test ix, X, y, groups (9, array([18, 19]), 9, 7)
---
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (0, array([0, 1]), 0, 0)
test ix, X, y, groups (1, array([2, 3]), 1, 0)
test ix, X, y, groups (2, array([4, 5]), 2, 0)
test ix, X, y, groups (3, array([6, 7]), 3, 1)
test ix, X, y, groups (7, array([14, 15]), 7, 5)
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (4, array([8, 9]), 4, 2)
test ix, X, y, groups (8, array([16, 17]), 8, 6)
随机状态为 5
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
test ix, X, y, groups (4, array([8, 9]), 4, 2)
test ix, X, y, groups (6, array([12, 13]), 6, 4)
test ix, X, y, groups (9, array([18, 19]), 9, 7)
---
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (0, array([0, 1]), 0, 0)
test ix, X, y, groups (1, array([2, 3]), 1, 0)
test ix, X, y, groups (2, array([4, 5]), 2, 0)
test ix, X, y, groups (3, array([6, 7]), 3, 1)
test ix, X, y, groups (7, array([14, 15]), 7, 5)
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (5, array([10, 11]), 5, 3)
test ix, X, y, groups (8, array([16, 17]), 8, 6)
【讨论】:
【参考方案4】:灵感来自 user0 的answer(无法评论)但速度更快:
def RandomGroupKFold_split(groups, n, seed=None): # noqa: N802
"""
Random analogous of sklearn.model_selection.GroupKFold.split.
:return: list of (train, test) indices
"""
groups = pd.Series(groups)
ix = np.arange(len(groups))
unique = np.unique(groups)
np.random.RandomState(seed).shuffle(unique)
result = []
for split in np.array_split(unique, n):
mask = groups.isin(split)
train, test = ix[~mask], ix[mask]
result.append((train, test))
return result
【讨论】:
【参考方案5】:我想组合 k-fold 组的代码,并且还希望训练集中和测试集中的类比例相同。因此,我对组进行了分层 k 折,以便在折叠中保持相同的类比率,然后使用这些组将样本放置在折叠中。我还将随机种子包含在分层中以解决不同的拆分问题。
def Stratified_Group_KFold(Y, groups, n, seed=None):
unique = np.unique(groups)
group_Y = []
for group in unique:
y = Y[groups.index(subject)]
group_Y.append(y)
group_X = np.zeros_like(unique)
skf_group = StratifiedKFold(n_splits = n, random_state = seed, shuffle=True)
result = []
for train_index, test_index in skf_group.split(group_X, group_Y):
train_groups_in_fold = unique[train_index]
test_groups_in_fold = unique[test_index]
train = np.in1d(groups, train_groups_in_fold).nonzero()[0]
test = np.in1d(groups, test_groups_in_fold).nonzero()[0]
result.append((train, test))
return result
【讨论】:
【参考方案6】:@user0
例如,我想要
GroupKFold(n_splits=2, random_state=42) ('TRAIN:', array([0, 1]), 'TEST:', array([2, 3])) ('TRAIN:', array([2, 3]), 'TEST:', array([0, 1])) GroupKFold(n_splits=2, random_state=13) ('TRAIN:', array([0, 2]), 'TEST:', array([1, 3])) ('TRAIN:', array([1, 3]), 'TEST:', array([0, 2]))
第二次拆分会将一个组拆分为训练集和测试集。这是 GroupKFold 应该避免的。例如,在第二次拆分中,组 0 中的一个元素(数据集中的指标 0 和 1)分别在训练和测试集中作为指标 0 和 1。
对于您给出的示例,进行分组 2 倍拆分的方法不超过一种,因为您只有 2 个组。
【讨论】:
【参考方案7】:GroupKFold
根据组标签显示确定性。所以解决方案是分配新标签。我通过改组唯一组标识符列表并将新标签从 0 分配到 n_groups - 1 来解决此问题。
import numpy as np
from sklearn.model_selection import GroupKFold
def get_random_labels(labels, random_state):
labels_shuffled = np.unique(labels)
# shuffle works in place
random_state.shuffle(labels_shuffled)
new_labels_mapping = k: i for i, k in enumerate(labels_shuffled)
new_labels = np.array([new_labels_mapping[label] for label in labels])
reverse_dict = v: k for k, v in new_labels_mapping.items()
return new_labels, reverse_dict
random_state = np.random.RandomState(41)
X = np.arange(20).reshape((10, 2))
y = np.arange(10)
groups = np.array([0, 0, 0, 1, 2, 3, 4, 5, 6, 7])
for _ in range(0, 5):
group_kfold = GroupKFold(n_splits=2)
new_labels, reverse_dict = get_random_labels(groups, random_state)
print(group_kfold)
for i, (train_index, test_index) in enumerate(group_kfold.split(X, y, new_labels)):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
groups_train, groups_test = groups[train_index], groups[test_index]
print("Split no.", i + 1, "Training y:", y_train, "Testing y:", y_test)
print()
输出:
GroupKFold(n_splits=2)
Split no. 1 Training y: [3 4 5 6 8] Testing y: [0 1 2 7 9]
Split no. 2 Training y: [0 1 2 7 9] Testing y: [3 4 5 6 8]
GroupKFold(n_splits=2)
Split no. 1 Training y: [3 4 7 8 9] Testing y: [0 1 2 5 6]
Split no. 2 Training y: [0 1 2 5 6] Testing y: [3 4 7 8 9]
GroupKFold(n_splits=2)
Split no. 1 Training y: [3 6 7 8 9] Testing y: [0 1 2 4 5]
Split no. 2 Training y: [0 1 2 4 5] Testing y: [3 6 7 8 9]
GroupKFold(n_splits=2)
Split no. 1 Training y: [5 6 7 8 9] Testing y: [0 1 2 3 4]
Split no. 2 Training y: [0 1 2 3 4] Testing y: [5 6 7 8 9]
GroupKFold(n_splits=2)
Split no. 1 Training y: [3 4 6 7 9] Testing y: [0 1 2 5 8]
Split no. 2 Training y: [0 1 2 5 8] Testing y: [3 4 6 7 9]
在这 10 个样本中,我使前三个属于第 0 组,其他每个都属于自己的唯一组。结果就是每次迭代的分裂都不一样。
reverse_dict
对象用于获取原始标签的身份。
【讨论】:
以上是关于如何获得可重现但不同的 GroupKFold 实例的主要内容,如果未能解决你的问题,请参考以下文章
sklearn可视化不同数据划分方法的差异:KFold, ShuffleSplit,StratifiedKFold, GroupKFold, StratifiedShuffleSplit.......