更改由 sklearn.model_selection.train_test_split 产生的数组中的值类型

Posted 2023-03-12

技术标签:

【中文标题】更改由 sklearn.model_selection.train_test_split 产生的数组中的值类型【英文标题】：Changing the type of values in arrays resulting from sklearn.model_selection.train_test_split 【发布时间】：2019-03-23 06:25:28 【问题描述】：

我正在 this tutorial 进行机器学习，其中使用了以下代码：

import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('breast-cancer-wisconsin.data.csv')
df.replace('?', -99999, inplace = True)
df.drop(['id'], 1, inplace = True)
X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])

X_train, X_test, y_test, y_train = train_test_split(X, y)

这是 csv 文件中的一个示例：

id,clump_thickness,unif_cell_size,unif_cell_shape, marg_adhesion,
single_epith_cell_size,bare_nuclei,bland_chrom,norm_nucleoli, mitoses,class
    1000025,5,1,1,1,2,1,3,1,1,2
    1002945,5,4,4,5,7,10,3,2,1,2
    1015425,3,1,1,1,2,2,3,1,1,2
    1016277,6,8,8,1,3,4,3,7,1,2
    1017023,4,1,1,3,2,1,3,1,1,2
    1017122,8,10,10,8,7,10,9,7,1,4
    1018099,1,1,1,1,2,10,3,1,1,2
    1018561,2,1,2,1,2,1,3,1,1,2
    1033078,2,1,1,1,2,1,1,1,5,2
    1033078,4,2,1,1,2,1,2,1,1,2
    1035283,1,1,1,1,1,1,3,1,1,2
    1036172,2,1,1,1,2,1,2,1,1,2
    1041801,5,3,3,3,2,3,4,4,1,4
    1043999,1,1,1,1,2,3,3,1,1,2
    1044572,8,7,5,10,7,9,5,5,4,4
    1047630,7,4,6,4,6,1,4,3,1,4
    1048672,4,1,1,1,2,1,2,1,1,2
    1049815,4,1,1,1,2,1,3,1,1,2
    1050670,10,7,7,6,4,10,4,1,2,4
    1050718,6,1,1,1,2,1,3,1,1,2
    1054590,7,3,2,10,5,10,5,4,4,4
    1054593,10,5,5,3,6,7,7,10,1,4
    1056784,3,1,1,1,2,1,2,1,1,2
    1057013,8,4,5,1,2,?,7,3,1,4
    1059552,1,1,1,1,2,1,3,1,1,2
    1065726,5,2,3,4,2,7,3,6,1,4
    1066373,3,2,1,1,1,1,2,1,1,2

查看sklearn.model_selection.train_test_split 的结果时，我发现了一些奇怪的东西（至少对我而言）。如果我跑

    print(type(y_test[0]))
    print()
    print(type(X_train[:,1][0]))

我得到以下输出：

<class 'numpy.int64'>
<class 'int'>

不知何故，X_train 中的值属于int 类型，y_test 中的值属于numpy.int64 类型。我不知道为什么train_test_split 会这样做——我认为这与正在拆分的数据无关——而且documentation 似乎也没有提到它。

由于我希望y_test 中的值也为常规整数，因此我尝试将y_test 的类型更改为astype()。不幸的是，下面的代码

y_test = y_test.astype(int)
print(type(y_test[0]))

<class 'numpy.int64'>

问题：为什么train_test_split 返回包含不同数据类型值的数组？为什么我无法将y_test 中的值转换为整数？

编辑：类型的差异是由数据引起的。如果我跑

 print(type(X[:,1][0]))
 print(type(y[0]))

我明白了

<class 'int'>
<class 'numpy.int64'>

我仍然想知道为什么 astype 不起作用！:)

【问题讨论】：

这里没有太大区别，除了几个字节。根据我的个人经验，numpy 更喜欢将结果存储在int64（所以这是用于y_test），而普通数组只存储为int。区别可以参考：***.com/questions/9696660/… @Shiv_90 感谢您的回复！虽然有一些实际差异。例如，将数据插入到类型为“numeric”的数据表列中适用于int，但不适用于numpy.int64 我明白了。这种排他性背后可能有很多原因；虽然这不是我确定的答案:) 【参考方案1】：

要将 numpy 值转换为 python 类型，有numpy.ndarray.item

y_test_int = [v.item() for v in y_test]
print(type(y_test_int[0]))
#<class 'int'>

【讨论】：

以上是关于更改由 sklearn.model_selection.train_test_split 产生的数组中的值类型的主要内容，如果未能解决你的问题，请参考以下文章