为啥 ShuffleSplit 比 train_test_split 更多/更少随机(使用 random_state=None)?
Posted
技术标签:
【中文标题】为啥 ShuffleSplit 比 train_test_split 更多/更少随机(使用 random_state=None)?【英文标题】:Why is ShuffleSplit more/less random than train_test_split (with random_state=None)?为什么 ShuffleSplit 比 train_test_split 更多/更少随机(使用 random_state=None)? 【发布时间】:2017-03-22 03:40:34 【问题描述】:考虑以下两个选项:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#sklearn.__version__ 17.1
#python --version 3.5.2, Anaconda 4.1.1 (64-bit)
#ipdb> TypeError: __init__() got an unexpected keyword argument 'n_splits'
#None
#> <string>(1)<module>()
import numpy as np
from sklearn.datasets import load_boston
#from sklearn.model_selection import train_test_split, cross_val_score
#from sklearn.model_selection import ShuffleSplit
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.cross_validation import ShuffleSplit
from sklearn.ensemble import GradientBoostingRegressor
# define feature matrix and target variable
X, y = load_boston().data, load_boston().target
# Create Algorithm Object (Gradient Boosting)
gbr = GradientBoostingRegressor(n_estimators=100, random_state=0)
#====================================================
# Option B
#====================================================
#shuffle = ShuffleSplit(n_splits=10, train_size=0.75, random_state=0)
shuffle = ShuffleSplit(n=X.shape[0], n_iter=10, train_size=0.75, random_state=0)
cross_val = cross_val_score(gbr, X, y, cv=shuffle)
print('------------------------------------------')
print('Individual performance: ', cross_val)
print('===============================================')
print('Option B: Average performance: ', cross_val.mean())
print('===============================================')
# --> different performance in every iteration because of different training
# and test sets.
#====================================================
# Option C
#====================================================
individual_results = []
iterations = np.arange(1, 11)
for i in iterations:
# randomly split the data into train and test
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25,
random_state=None)
# train gbr 10x with on new data set
gbr.fit(Xtrain, ytrain)
score = gbr.score(Xtrain, ytrain)
individual_results.append(score)
avg_score = sum(individual_results)/len(iterations)
print('------------------------------------------')
print(individual_results)
print('===============================================')
print('Option C: Average Performance: ', avg_score)
print('===============================================')
这是输出的副本:
Individual performance: [ 0.77535372 0.81760604 0.87146377 0.94041114 0.92648961 0.87761488
0.82843891 0.81833855 0.90167889 0.90014986]
===============================================
Option B: Average performance: 0.865754537049
===============================================
------------------------------------------
[0.98094508160609573, 0.97773541952198795, 0.98076500920740906, 0.98313150025465956, 0.98097867267357952, 0.97918425360465322, 0.97923641784508919, 0.9785058355467865, 0.98173521302711486, 0.97866493105257402]
===============================================
Option C: Average Performance: 0.980088233434
===============================================
谁能帮助解释为什么选项 B 中的 ShuffleSplit 函数比选项 C 中的 train_test_split 函数(random_state=None)呈现更多随机结果?
【问题讨论】:
...嗯,我没有看到结果?另外:load_boston
是什么?我们没有重现您所看到内容的工具...
如何计算结果的随机性?
添加了输出的副本。请注意选项 C 的结果似乎聚集在 0.98 附近,而选项 B 似乎更加随机。
这让我很困惑,因为train_test_split
实际上调用了next(ShuffleSplit().split(X, y))
...
【参考方案1】:
分数是根据Xtrain
而不是选项 C 中的XTest
计算的
有
score = gbr.score(Xtest, ytest)
现在的分数是
[0.806, 0.906, 0.903, 0.836, 0.871, 0.920, 0.902, 0.901, 0.914, 0.916]
【讨论】:
以上是关于为啥 ShuffleSplit 比 train_test_split 更多/更少随机(使用 random_state=None)?的主要内容,如果未能解决你的问题,请参考以下文章
sklearn可视化不同数据划分方法的差异:KFold, ShuffleSplit,StratifiedKFold, GroupKFold, StratifiedShuffleSplit.......
scikit-learn - train_test_split 和 ShuffleSplit 产生非常不同的结果
sklearn ShuffleSplit 出现“__init__() 参数 'n_splits' 的多个值”错误