为啥 ShuffleSplit 比 train_test_split 更多/更少随机(使用 random_state=None)?

Posted

技术标签:

【中文标题】为啥 ShuffleSplit 比 train_test_split 更多/更少随机(使用 random_state=None)?【英文标题】:Why is ShuffleSplit more/less random than train_test_split (with random_state=None)?为什么 ShuffleSplit 比 train_test_split 更多/更少随机(使用 random_state=None)? 【发布时间】:2017-03-22 03:40:34 【问题描述】:

考虑以下两个选项:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

#sklearn.__version__ 17.1
#python --version 3.5.2, Anaconda 4.1.1 (64-bit)

#ipdb> TypeError: __init__() got an unexpected keyword argument 'n_splits'
#None
#> <string>(1)<module>()

import numpy as np
from sklearn.datasets import load_boston
#from sklearn.model_selection import train_test_split, cross_val_score
#from sklearn.model_selection import ShuffleSplit
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.cross_validation import ShuffleSplit
from sklearn.ensemble import GradientBoostingRegressor

# define feature matrix and target variable
X, y = load_boston().data, load_boston().target

# Create Algorithm Object (Gradient Boosting)
gbr = GradientBoostingRegressor(n_estimators=100, random_state=0)

#====================================================
# Option B
#====================================================
#shuffle = ShuffleSplit(n_splits=10, train_size=0.75, random_state=0)
shuffle = ShuffleSplit(n=X.shape[0], n_iter=10, train_size=0.75, random_state=0)
cross_val = cross_val_score(gbr, X, y, cv=shuffle)
print('------------------------------------------')
print('Individual performance: ', cross_val)
print('===============================================')
print('Option B: Average performance: ', cross_val.mean())
print('===============================================')
# --> different performance in every iteration because of different training
# and test sets.


#====================================================
# Option C
#====================================================
individual_results = []
iterations = np.arange(1, 11)

for i in iterations:
    # randomly split the data into train and test
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25,
                                                    random_state=None)
    # train gbr 10x with on new data set
    gbr.fit(Xtrain, ytrain)
    score = gbr.score(Xtrain, ytrain)
    individual_results.append(score)

avg_score = sum(individual_results)/len(iterations)
print('------------------------------------------')
print(individual_results)
print('===============================================')
print('Option C: Average Performance: ', avg_score)
print('===============================================')

这是输出的副本:

Individual performance:  [ 0.77535372  0.81760604  0.87146377  0.94041114  0.92648961  0.87761488
  0.82843891  0.81833855  0.90167889  0.90014986]
===============================================
Option B: Average performance:  0.865754537049
===============================================
------------------------------------------
[0.98094508160609573, 0.97773541952198795, 0.98076500920740906, 0.98313150025465956, 0.98097867267357952, 0.97918425360465322, 0.97923641784508919, 0.9785058355467865, 0.98173521302711486, 0.97866493105257402]
===============================================
Option C: Average Performance:  0.980088233434
===============================================

谁能帮助解释为什么选项 B 中的 ShuffleSplit 函数比选项 C 中的 train_test_split 函数(random_state=None)呈现更多随机结果?

【问题讨论】:

...嗯,我没有看到结果?另外:load_boston 是什么?我们没有重现您所看到内容的工具... 如何计算结果的随机性? 添加了输出的副本。请注意选项 C 的结果似乎聚集在 0.98 附近,而选项 B 似乎更加随机。 这让我很困惑,因为train_test_split 实际上调用了next(ShuffleSplit().split(X, y))... 【参考方案1】:

分数是根据Xtrain 而不是选项 C 中的XTest 计算的

score = gbr.score(Xtest, ytest)

现在的分数是

[0.806, 0.906, 0.903, 0.836, 0.871, 0.920, 0.902, 0.901, 0.914, 0.916]

【讨论】:

以上是关于为啥 ShuffleSplit 比 train_test_split 更多/更少随机(使用 random_state=None)?的主要内容,如果未能解决你的问题,请参考以下文章

sklearn可视化不同数据划分方法的差异:KFold, ShuffleSplit,StratifiedKFold, GroupKFold, StratifiedShuffleSplit.......

python中shuffleSplit()函数

scikit-learn - train_test_split 和 ShuffleSplit 产生非常不同的结果

sklearn ShuffleSplit 出现“__init__() 参数 'n_splits' 的多个值”错误

Matlab中libsvm回归怎么做时间序列的单步和多步预测

3为啥接受不了别人比自己优秀