使用 train_test_split 后分类器准确率为 100%

Posted 2023-02-23

技术标签:

【中文标题】使用 train_test_split 后分类器准确率为 100%【英文标题】：100% classifier accuracy after using train_test_split 【发布时间】：2020-05-13 04:52:46 【问题描述】：

我正在研究蘑菇分类数据集（可在此处找到：https://www.kaggle.com/uciml/mushroom-classification）。

我正在尝试将我的数据拆分为我的模型的训练集和测试集，但是如果我使用 train_test_split 方法，我的模型总是可以达到 100% 的准确率。当我手动拆分数据时，情况并非如此。

x = data.copy()
y = x['class']
del x['class']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)

print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))

这会产生：

[[1299    0]
 [   0 1382]]
1.0

如果我手动拆分数据，我会得到更合理的结果。

x = data.copy()
y = x['class']
del x['class']

x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]

model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)

print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))

结果：

[[2007    0]
 [ 336  337]]
0.8746268656716418

什么可能导致这种行为？

编辑： 根据要求，我包括切片的形状。

train_test_split：

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

结果：

(5443, 64)
(5443,)
(2681, 64)
(2681,)

手动拆分：

x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

结果：

(5443, 64)
(5443,)
(2680, 64)
(2680,)

我已经尝试定义自己的拆分函数，结果拆分也可以实现 100% 的分类器准确度。

这是拆分的代码

def split_data(dataFrame, testRatio):
  dataCopy = dataFrame.copy()
  testCount = int(len(dataFrame)*testRatio)
  dataCopy = dataCopy.sample(frac = 1)
  y = dataCopy['class']
  del dataCopy['class']
  return dataCopy[testCount:], dataCopy[0:testCount], y[testCount:], y[0:testCount]

【问题讨论】：

X_train, X_test, y_train, y_test各个方法拆分后的形状是什么？ @G.Anderson 我已经用形状更新了我的问题如果再次运行 train_test_plit 或更改 test_size 参数，行为是否仍然存在？有可能（虽然不太可能）你第一次得到了一个非常幸运的分裂。否则，您是否对数据进行了任何其他未显示的转换？这看起来很像训练和测试之间或目标和特征之间的数据泄漏它会在整个尝试过程中持续存在，如果我更改测试大小（无论我将其更改为 100%）。我已经对数据进行了一些预处理，但这一切都是在我拆分数据集之前完成的。等等！分手前preprocessing做了什么？您不应该对整个数据集执行feature selection。就在火车集和变换火车上，用它来测试集。 standard scalar 也相同，在拆分和转换两个训练后拟合训练数据，用它进行测试。如果您的手动拆分代码没有问题，您可能会以这种方式将数据从训练集泄漏到测试集。 【参考方案1】：

您的手动训练测试拆分没有随机播放，但 scikit 功能默认开启随机播放。分割形状相同，但数据不同。

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

代码：

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(18).reshape((9, 2)), range(9)
print(X)
print(list(y))
X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.33, random_state=42)

print("\nTraining with shuffle:")
print(X_train)
print(y_train)


print("\nTesting with shuffle:")
print(X_test)
print(y_test)


print("\nWithout Shuffle:")
tmp = train_test_split(X, y, test_size=0.33, shuffle=False)
print(tmp[0])
print(tmp[2])
print()
print(tmp[1])
print(tmp[3])

输出：

[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]
 [12 13]
 [14 15]
 [16 17]]
[0, 1, 2, 3, 4, 5, 6, 7, 8]

Training with shuffle:
[[ 0  1]
 [16 17]
 [ 4  5]
 [ 8  9]
 [ 6  7]
 [12 13]]
[0, 8, 2, 4, 3, 6]

Testing with shuffle:
[[14 15]
 [ 2  3]
 [10 11]]
[7, 1, 5]

Without Shuffle:
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]]
[0, 1, 2, 3, 4, 5]

[[12 13]
 [14 15]
 [16 17]]
[6, 7, 8]

【讨论】：

那部分我明白了，但是为什么会影响分类器的结果呢？这似乎更像是评论而不是答案。您是否确认这会影响 OP 中描述的模型行为？将 shuffle 设置为 false 确实会影响模型的准确性。似乎无论我做什么，如果我打乱数据的顺序，我都会得到 100% 的准确率。我刚刚检查了 UCI 存储库，它说，可食用：4208 (51.8%)，有毒：3916 (48.2%)，没有随机播放数据集的 33% 拆分不平衡。 8124 * 0.33 = 2681 测试和 8124 - 2681 = 5443 训练。如果数据集不是按顺序组织的，首先是可食用的，然后是有毒的，那么 5443 - 4208 = 1235 表示有毒，4208 表示可食用。这是不平衡的。【参考方案2】：

结果结果是正确的，我只是在测试模型产生的结果时走错了路。

我打开了另一个thread，有人建议尝试交叉验证，这似乎可以解决问题。

【讨论】：

【参考方案3】：

你在 train_test_split 上很幸运。您手动进行的拆分可能包含最不可见的数据，这比 train_test_split 进行更好的验证，后者在内部对数据进行混洗以拆分它。

为了更好地验证，请使用 K 折交叉验证，这将允许验证模型的准确性，将数据中的每个不同部分作为测试，将其余部分作为训练。

【讨论】：

以上是关于使用 train_test_split 后分类器准确率为 100%的主要内容，如果未能解决你的问题，请参考以下文章