连续值目标的分层 5 折交叉验证 y 中人口最少的类只有 1 个成员,太少了

Posted

技术标签:

【中文标题】连续值目标的分层 5 折交叉验证 y 中人口最少的类只有 1 个成员,太少了【英文标题】:stratified 5-fold cross validation for continuous-value taregt The least populated class in y has only 1 member, which is too few 【发布时间】:2022-01-09 10:38:38 【问题描述】:

对于此代码:

#x_train, x_val, y_train, y_val=train_test_split(x,y,test_size=0.3, random_state=42)
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.3, random_state=42, stratify=y)
train = [x_train, y_train] 

我收到以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_28063/1294340868.py in <module>
      1 #x_train, x_val, y_train, y_val=train_test_split(x,y,test_size=0.3, random_state=42)
----> 2 x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.3, random_state=42, stratify=y)
      3 train = [x_train, y_train]

/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
   2441         cv = CVClass(test_size=n_test, train_size=n_train, random_state=random_state)
   2442 
-> 2443         train, test = next(cv.split(X=arrays[0], y=stratify))
   2444 
   2445     return list(

/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
   1598         """
   1599         X, y, groups = indexable(X, y, groups)
-> 1600         for train, test in self._iter_indices(X, y, groups):
   1601             yield train, test
   1602 

/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py in _iter_indices(self, X, y, groups)
   1938         class_counts = np.bincount(y_indices)
   1939         if np.min(class_counts) < 2:
-> 1940             raise ValueError(
   1941                 "The least populated class in y has only 1"
   1942                 " member, which is too few. The minimum"

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

如果我改用下面的行,我不会收到错误:

x_train, x_val, y_train, y_val=train_test_split(x,y,test_size=0.3, random_state=42)

但是,我的目的是进行分层 5 折交叉验证。我应该如何做到这一点?我知道对于我的某些目标值,我只有 1 个项目,而分层需要超过 1 个项目。如何将这些垃圾箱组合在一起?

这是我的目标 y 归一化直方图的样子:

这也是 y 的未标准化图:

这是 y 分布的 sn-p。如您所见,有很多目标的 bin 中只有 1 个项目。

更新: 请注意,我从 verstack 包中找到了此代码,但是,我不知道如何使用它进行 5 折交叉验证。

x_train, x_val, y_train, y_val = scsplit(x, y, stratify = y, test_size=0.3, random_state=42)
train = [x_train, y_train] 

【问题讨论】:

【参考方案1】:

您无法执行分层拆分,因为存在仅存在一次的值,因此它们无法在训练集和测试集中进行均匀的重新分区。

曾经的解决方案是使用KBinsDiscretizer 将此连续变量分成区间,并对其执行分层拆分,如下所示:

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer

X, y = make_regression()
y_discretized = KBinsDiscretizer(n_bins=10,
                                 encode='ordinal',
                                 strategy='uniform').fit_transform(y.reshape(-1, 1))

X_train, X_val, y_train, y_val = train_test_split(X, y, 
                                                  test_size=0.3,
                                                  random_state=42,
                                                  stratify=y_discretized)

【讨论】:

以上是关于连续值目标的分层 5 折交叉验证 y 中人口最少的类只有 1 个成员,太少了的主要内容,如果未能解决你的问题,请参考以下文章

一种热编码标签和分层 K 折交叉验证

scikit-learn中的随机分层k折交叉验证?

如何计算分层 K 折交叉验证的不平衡数据集的误报率?

Scikit-Learn 中的分层标记 K 折交叉验证

需要多标签分层kfold的帮助

R中的分层k倍交叉验证