大熊猫上的sklearn train_test_split 按多列分层

Posted 2023-02-23

技术标签:

【中文标题】大熊猫上的sklearn train_test_split 按多列分层【英文标题】：sklearn train_test_split on pandas stratify by multiple columns 【发布时间】：2018-01-12 23:02:50 【问题描述】：

我是 sklearn 的一个相对较新的用户，并且在 sklearn.model_selection 的 train_test_split 中遇到了一些意外行为。我有一个熊猫数据框，我想将其拆分为训练和测试集。我想在我的数据框中将我的数据至少分层 2 列，但最好是 4 列。

当我尝试执行此操作时，sklearn 没有发出警告，但后来我发现我的最终数据集中有重复的行。我创建了一个示例测试来显示这种行为：

from sklearn.model_selection import train_test_split
a = np.array([i for i in range(1000000)])
b = [i%10 for i in a]
c = [i%5 for i in a]
df = pd.DataFrame('a':a, 'b':b, 'c':c)

如果我按任一列分层，它似乎按预期工作：

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b']])
print(len(train.a.values))  # prints 800000
print(len(set(train.a.values)))  # prints 800000

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['c']])
print(len(train.a.values))  # prints 800000
print(len(set(train.a.values)))  # prints 800000

但是当我尝试按两列进行分层时，我得到重复的值：

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b', 'c']])
print(len(train.a.values))  # prints 800000
print(len(set(train.a.values)))  # prints 640000

【问题讨论】：

【参考方案1】：

您需要反复拆分数据。在 scikit-multilearn 中有一个类。有点烦人，它只适用于 NumPy 数组，但你能做什么？

这是一个可以满足您要求的功能：

import pandas as pd
from skmultilearn.model_selection import IterativeStratification

def iterative_split(df, test_size, stratify_columns):
    """Custom iterative train test split which
    'maintains balanced representation with respect
    to order-th label combinations.'

    From https://madewithml.com/courses/mlops/splitting/#stratified-split
    """
    # One-hot encode the stratify columns and concatenate them
    one_hot_cols = [pd.get_dummies(df[col]) for col in stratify_columns]
    one_hot_cols = pd.concat(one_hot_cols, axis=1).to_numpy()
    stratifier = IterativeStratification(
        n_splits=2, order=len(stratify_columns), sample_distribution_per_fold=[test_size, 1-test_size])
    train_indices, test_indices = next(stratifier.split(df.to_numpy(), one_hot_cols))
    # Return the train and test set dataframes
    train, test = df.iloc[train_indices], df.iloc[test_indices]
    return train, test

example = pd.DataFrame('a': [1, 2, 3]*8*2, 'b': [4, 5, 6, 7]*6*2, 'c': [7, 8]*12*2)
train, test = iterative_split(example, 0.4, ['a', 'b'])
# print(f'train =')
# print(f'test =')

print(f'train[["a"]].value_counts() =')
print(f'test[["a"]].value_counts()  =')
print(f'train[["b"]].value_counts() =')
print(f'test[["b"]].value_counts()  =')

输出

train[["a"]].value_counts() =a
1    10
2    10
3    10
dtype: int64
test[["a"]].value_counts()  =a
1    6
2    6
3    6
dtype: int64
train[["b"]].value_counts() =b
5    8
6    8
4    7
7    7
dtype: int64
test[["b"]].value_counts()  =b
4    5
7    5
5    4
6    4
dtype: int64

对于您的示例，我们可以添加以下代码：

import numpy as np

a = np.array([i for i in range(10_000)])
b = [i%10 for i in a]
c = [i%5 for i in a]
df = pd.DataFrame('a':a, 'b':b, 'c':c)

train, test = iterative_split(df, test_size=0.2, stratify_columns=['b', 'c'])

print(len(train.a.values))  # prints 8000
print(len(set(train.a.values)))  # prints 8000

one_hot_cols 在您的示例中变为 1e6 x 3e5 的矩阵，这有点多。如果有人想出更好的方法，那么我会全力以赴。

【讨论】：

【参考方案2】：

如果您希望 train_test_split 的行为符合您的预期（按多列且不重复分层），请创建一个新列，该列是其他列中的值的串联，并在新列上分层。

df['bc'] = df['b'].astype(str) + df['c'].astype(str)
train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['bc']])

如果您担心由于11 和3 和1 和13 之类的值会产生113 的串联值，那么您可以在中间添加一些任意字符串：

df['bc'] = df['b'].astype(str) + "_" + df['c'].astype(str)

【讨论】：

这是一个残酷的黑客攻击，有你描述的问题（可能的碰撞......），但目前看来，这似乎是唯一的方法。使用例如df[['a', 'b']].apply(tuple) 提高 ValueError: Supported target types are: ('binary', 'multiclass'). Got 'unknown' instead.【参考方案3】：

您使用的是什么版本的 scikit-learn？可以使用sklearn.__version__查看。

在 0.19.0 版本之前，scikit-learn 无法正确处理二维分层。它已在 0.19.0 中修补。

在issue #9044 中有描述。

更新您的 scikit-learn 应该可以解决问题。如果您无法更新您的 scikit-learn，请参阅此提交历史记录 here 以获取修复。

【讨论】：

这里的“正确”是什么意思？这是否意味着它执行了 andrew_reece 提到的嵌套分层抽样？我刚刚测试了一下，好像确实做了嵌套分层抽样。谢谢你的回答，很有帮助！ Sklearn 根本不处理二维分层（此处为 0.22）。请参阅：***.com/questions/48508036/… - 所以这只适用于单热向量，不适用于特征元组！ Probalby 上面的 string-concat-hack 是目前唯一的方法。【参考方案4】：

您得到重复的原因是因为train_test_split() 最终将strata 定义为您传递给stratify 参数的任何值的唯一值集。由于层是从两列定义的，一行数据可能代表多个层，因此抽样可能会选择同一行两次，因为它认为它是从不同的类中抽样的。

train_test_split() 函数 calls StratifiedShuffleSplit，uses np.unique() 在 y 上（这是您通过 stratify 传递的内容）。来自源代码：

classes, y_indices = np.unique(y, return_inverse=True)
n_classes = classes.shape[0]

这是一个简化的示例案例，是您提供的示例的变体：

from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

N = 20
a = np.arange(N)
b = np.random.choice(["foo","bar"], size=N)
c = np.random.choice(["y","z"], size=N)
df = pd.DataFrame('a':a, 'b':b, 'c':c)

print(df)
     a    b  c
0    0  bar  y
1    1  foo  y
2    2  bar  z
3    3  bar  y
4    4  foo  z
5    5  bar  y
...

分层函数认为有四个类可以拆分：foo、bar、y 和 z。但是由于这些类本质上是嵌套的，这意味着y 和z 都出现在b == foo 和b == bar 中，所以当拆分器尝试从每个类中采样时，我们会得到重复。

train, test = train_test_split(df, test_size=0.2, random_state=0, 
                               stratify=df[['b', 'c']])
print(len(train.a.values))  # 16
print(len(set(train.a.values)))  # 12

print(train)
     a    b  c
3    3  bar  y   # selecting a = 3 for b = bar*
5    5  bar  y
13  13  foo  y
4    4  foo  z
14  14  bar  z
10  10  foo  z
3    3  bar  y   # selecting a = 3 for c = y
6    6  bar  y
16  16  foo  y
18  18  bar  z
6    6  bar  y
8    8  foo  y
18  18  bar  z
7    7  bar  z
4    4  foo  z
19  19  bar  y

#* We can't be sure which row is selecting for `bar` or `y`, 
#  I'm just illustrating the idea here.

这里有一个更大的设计问题：您是想使用嵌套分层抽样，还是实际上只是想将df.b 和df.c 中的每个类作为一个单独的类进行抽样？如果是后者，那就是你已经得到的。前者更复杂，这不是 train_test_split 设置的目的。

您可能会发现嵌套分层抽样的this discussion 很有用。

【讨论】：

以上是关于大熊猫上的sklearn train_test_split 按多列分层的主要内容，如果未能解决你的问题，请参考以下文章

如何在熊猫数据框上使用 sklearn TFIdfVectorizer

带有熊猫数据框输入的 sklearn 分类报告产生：“TypeError：并非所有参数都在字符串格式化期间转换”

从csv获取熊猫系列

带有 MultilabelBinarizer 的 sklearn ColumnTransformer

sklearn中的模型选择和分层抽样

带有熊猫的剪影分数的正确数据格式