与 xgboost.cv 相比,GridSearchCV 没有给出与预期相同的结果
Posted
技术标签:
【中文标题】与 xgboost.cv 相比,GridSearchCV 没有给出与预期相同的结果【英文标题】:GridSearchCV does not give the same results as expected when compared to xgboost.cv 【发布时间】:2017-06-15 19:14:02 【问题描述】:当将 sklearn.GridSearchCV 与 xgboost.cv 进行比较时,我得到了不同的结果......下面我解释了我想要做什么:
1) 导入库
import numpy as np
from sklearn import datasets
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import StratifiedKFold
2) 设置种子和折叠
seed = 5
n_fold_inner = 5
skf_inner = StratifiedKFold(n_splits=n_fold_inner,random_state=seed, shuffle=True)
3) 加载数据集
X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1)
X = X.astype(np.float32)
# map labels from -1, 1 to 0, 1
labels, y = np.unique(y, return_inverse=True)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]
dtrain = xgb.DMatrix(X_train, label=y_train, missing = np.nan)
4) 定义参数xgboost
fixed_parameters =
'max_depth':3,
'min_child_weight':3,
'learning_rate':0.3,
'colsample_bytree':0.8,
'subsample':0.8,
'gamma':0,
'max_delta_step':0,
'colsample_bylevel':1,
'reg_alpha':0,
'reg_lambda':1,
'scale_pos_weight':1,
'base_score':0.5,
'seed':5,
'objective':'binary:logistic',
'silent': 1
5) 我进行网格搜索的参数(只有一个,即估计器的数量)
params_grid =
'n_estimators':np.linspace(1, 20, 20).astype('int')
6) 执行网格搜索
bst_grid = GridSearchCV(
estimator=XGBClassifier(**fixed_parameters),param_grid=params_grid,n_jobs=4,
cv=skf_inner,scoring='roc_auc',iid=False,refit=False,verbose=1)
bst_grid.fit(X_train,y_train)
best_params_grid_search = bst_grid.best_params_
best_score_grid_search = bst_grid.best_score_
means_train = bst_grid.cv_results_['mean_train_score']
stds_train = bst_grid.cv_results_['std_train_score']
means_test = bst_grid.cv_results_['mean_test_score']
stds_test = bst_grid.cv_results_['std_test_score']
7) 打印结果
print('\ntest-auc-mean test-auc-std train-auc-mean train-auc-std')
for idx in range(0, len(means_test)):
print means_test[idx], stds_test[idx], means_train[idx], stds_train[idx]
8) 现在我使用与之前相同的参数运行 xgb.cv 20 轮(我之前作为 gridsearch 输入的 n_estimators。问题是我得到不同的结果...
num_rounds = 20
best_params_grid_search['objective']= 'binary:logistic'
best_params_grid_search['silent']= 1
cv_xgb = xgb.cv(best_params_grid_search,dtrain,num_boost_round =num_rounds,folds=skf_inner,metrics='auc',seed=seed,maximize=True)
print(cv_xgb)
RESULT GRIDSEARCH(每行使用 n 个估计器 (1,2,3,...,20)
test-auc-mean test-auc-std train-auc-mean train-auc-std
0.610051313783 0.0161039540435 0.644057288587 0.0113345992869
0.69201880047 0.0162563563448 0.736006666658 0.00692672815659
0.745466211655 0.0171675737271 0.796345885396 0.00696679302744
0.783959748994 0.00705320521545 0.841463145757 0.00948465661336
0.814666429161 0.0205663250121 0.876016226998 0.00594191823748
0.834757856446 0.0380407635359 0.89839145346 0.0119466187041
0.846589877247 0.0250769570711 0.918506450202 0.00400934458132
0.856519550489 0.02076405634 0.929968936282 0.00287173282935
0.874262106553 0.0270140215944 0.940190511945 0.00335749381638
0.884796282407 0.0242102758081 0.947369708661 0.00274634034559
0.890833683342 0.0240690598159 0.953708404754 0.00332080069217
0.898287157179 0.0212975975614 0.958794323829 0.00463360376002
0.905931348284 0.0240526927266 0.963055575138 0.00385161158711
0.911782932073 0.0169788764956 0.966542306102 0.00274612227499
0.912551138778 0.0175200936415 0.969060984867 0.00135518880398
0.915046588665 0.0169918459539 0.971904231381 0.00177694652262
0.917921423036 0.0131486037603 0.975162276052 0.0025983006922
0.921909172729 0.0113192686772 0.976056924526 0.0022670828819
0.928131653291 0.0117709832599 0.978585868159 0.00211167800105
0.931493562339 0.0119475329984 0.98098486872 0.00186032225868
结果 XGB.CV
test-auc-mean test-auc-std train-auc-mean train-auc-std
0 0.669881 0.013938 0.772116 0.011315
1 0.759682 0.019225 0.883394 0.004381
2 0.798337 0.016992 0.939274 0.005196
3 0.827751 0.007224 0.962461 0.007382
4 0.850340 0.011451 0.978809 0.001102
5 0.864438 0.020012 0.986584 0.000858
6 0.879706 0.014168 0.991765 0.001926
7 0.889308 0.013851 0.994663 0.000970
8 0.897973 0.011383 0.996704 0.000481
9 0.903878 0.012139 0.997494 0.000432
10 0.909599 0.010234 0.998301 0.000602
11 0.912682 0.014475 0.998972 0.000306
12 0.914289 0.014122 0.999392 0.000207
13 0.916273 0.011744 0.999568 0.000185
14 0.918050 0.011219 0.999718 0.000140
15 0.922161 0.011968 0.999788 0.000146
16 0.922990 0.010124 0.999863 0.000085
17 0.924221 0.009026 0.999893 0.000082
18 0.925718 0.008859 0.999929 0.000060
19 0.926104 0.007586 0.999959 0.000030
【问题讨论】:
【参考方案1】:num_boost_round 是提升迭代次数(即 n_estimators)。 XGBoost.cv 将忽略参数中的 n_estimators 并用 num_boost_round 覆盖它。
试试这个:
cv_xgb = xgb.cv(best_params_grid_search,dtrain,num_boost_round =best_params_grid_search['n_estimators'],folds=skf_inner,metrics='auc',seed=seed,maximize=True)
【讨论】:
以上是关于与 xgboost.cv 相比,GridSearchCV 没有给出与预期相同的结果的主要内容,如果未能解决你的问题,请参考以下文章
xgboost.cv 给出 TypeError: 'StratifiedKFold' object is not iterable