连续值目标的分层 5 折交叉验证 y 中人口最少的类只有 1 个成员,太少了
Posted
技术标签:
【中文标题】连续值目标的分层 5 折交叉验证 y 中人口最少的类只有 1 个成员,太少了【英文标题】:stratified 5-fold cross validation for continuous-value taregt The least populated class in y has only 1 member, which is too few 【发布时间】:2022-01-09 10:38:38 【问题描述】:对于此代码:
#x_train, x_val, y_train, y_val=train_test_split(x,y,test_size=0.3, random_state=42)
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.3, random_state=42, stratify=y)
train = [x_train, y_train]
我收到以下错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_28063/1294340868.py in <module>
1 #x_train, x_val, y_train, y_val=train_test_split(x,y,test_size=0.3, random_state=42)
----> 2 x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.3, random_state=42, stratify=y)
3 train = [x_train, y_train]
/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
2441 cv = CVClass(test_size=n_test, train_size=n_train, random_state=random_state)
2442
-> 2443 train, test = next(cv.split(X=arrays[0], y=stratify))
2444
2445 return list(
/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
1598 """
1599 X, y, groups = indexable(X, y, groups)
-> 1600 for train, test in self._iter_indices(X, y, groups):
1601 yield train, test
1602
/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py in _iter_indices(self, X, y, groups)
1938 class_counts = np.bincount(y_indices)
1939 if np.min(class_counts) < 2:
-> 1940 raise ValueError(
1941 "The least populated class in y has only 1"
1942 " member, which is too few. The minimum"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
如果我改用下面的行,我不会收到错误:
x_train, x_val, y_train, y_val=train_test_split(x,y,test_size=0.3, random_state=42)
但是,我的目的是进行分层 5 折交叉验证。我应该如何做到这一点?我知道对于我的某些目标值,我只有 1 个项目,而分层需要超过 1 个项目。如何将这些垃圾箱组合在一起?
这是我的目标 y 归一化直方图的样子:
这也是 y 的未标准化图:
这是 y 分布的 sn-p。如您所见,有很多目标的 bin 中只有 1 个项目。
更新: 请注意,我从 verstack 包中找到了此代码,但是,我不知道如何使用它进行 5 折交叉验证。
x_train, x_val, y_train, y_val = scsplit(x, y, stratify = y, test_size=0.3, random_state=42)
train = [x_train, y_train]
【问题讨论】:
【参考方案1】:您无法执行分层拆分,因为存在仅存在一次的值,因此它们无法在训练集和测试集中进行均匀的重新分区。
曾经的解决方案是使用KBinsDiscretizer
将此连续变量分成区间,并对其执行分层拆分,如下所示:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer
X, y = make_regression()
y_discretized = KBinsDiscretizer(n_bins=10,
encode='ordinal',
strategy='uniform').fit_transform(y.reshape(-1, 1))
X_train, X_val, y_train, y_val = train_test_split(X, y,
test_size=0.3,
random_state=42,
stratify=y_discretized)
【讨论】:
以上是关于连续值目标的分层 5 折交叉验证 y 中人口最少的类只有 1 个成员,太少了的主要内容,如果未能解决你的问题,请参考以下文章