在 python 中使用 gridsearchcv 进行梯度提升分类器的参数调整
Posted
技术标签:
【中文标题】在 python 中使用 gridsearchcv 进行梯度提升分类器的参数调整【英文标题】:Parameter Tuning using gridsearchcv for gradientboosting classifier in python 【发布时间】:2020-03-05 23:22:53 【问题描述】:我正在尝试在 gridsearchcv 的帮助下运行GradientBoostingClassifier()
。
对于每个参数组合,我还需要表格格式的“Precison”、“recall”和accuracy。
代码如下:
scoring= ['accuracy', 'precision','recall']
parameters = #'nthread':[3,4], #when use hyperthread, xgboost may become slower
"criterion": ["friedman_mse", "mae"],
"loss":["deviance","exponential"],
"max_features":["log2","sqrt"],
'learning_rate': [0.01,0.05,0.1,1,0.5], #so called `eta` value
'max_depth': [3,4,5],
'min_samples_leaf': [4,5,6],
'subsample': [0.6,0.7,0.8],
'n_estimators': [5,10,15,20],#number of trees, change it to 1000 for better results
'scoring':scoring
# sorted(sklearn.metrics.SCORERS.keys()) # To see different loss functions
#clf_xgb = GridSearchCV(xgb_model, parameters, n_jobs=5,verbose=2, refit=True,cv = 8)
clf_gbm = GridSearchCV(gbm_model, parameters, n_jobs=5,cv = 8)
clf_gbm.fit(X_train,y_train)
print(clf_gbm.best_params_)
print(clf_gbm.best_score_)
feature_importances = pd.DataFrame(clf_gbm.best_estimator_.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)
depth=clf_gbm.cv_results_["param_max_depth"]
score=clf_gbm.cv_results_["mean_test_score"]
params=clf_gbm.cv_results_["params"]
我得到错误:
ValueError: Invalid parameter seed for estimator GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.01, loss='deviance', max_depth=3,
max_features='log2', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=4, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=5, presort='auto',
random_state=None, subsample=1.0, verbose=0,
warm_start=False). Check the list of available parameters with `estimator.get_params().keys()`.
【问题讨论】:
Gradient Boost Classifier 只支持以下参数,它没有参数 'seed' 和 'missing' 而是使用 random_state 作为种子,支持的参数:-loss='deviance', learning_rate =0.1,n_estimators=100,subsample=1.0,criteria='friedman_mse',min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_depth=3,min_impurity_decrease=0.0,min_impurity_split=None,init=None,random_state=None,max_features =None,verbose=0,max_leaf_nodes=None,warm_start=False,presort='auto',validation_fraction=0.1,n_iter_no_change=None,tol=0.0001 我只需要去除种子。? 是的,删除种子和缺失并使用 random_state 作为种子 即使在那之后我也得到了错误:ValueError: Invalid parameter score for estimator GradientBoostingClassifier(criterion='friedman_mse', init=None, 你能查一下吗? 【参考方案1】:from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import make_scorer
#creating Scoring parameter:
scoring = 'accuracy': make_scorer(accuracy_score),
'precision': make_scorer(precision_score),'recall':make_scorer(recall_score)
# A sample parameter
parameters =
"loss":["deviance"],
"learning_rate": [0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2],
"min_samples_split": np.linspace(0.1, 0.5, 12),
"min_samples_leaf": np.linspace(0.1, 0.5, 12),
"max_depth":[3,5,8],
"max_features":["log2","sqrt"],
"criterion": ["friedman_mse", "mae"],
"subsample":[0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0],
"n_estimators":[10]
#passing the scoring function in the GridSearchCV
clf = GridSearchCV(GradientBoostingClassifier(), parameters,scoring=scoring,refit=False,cv=2, n_jobs=-1)
clf.fit(trainX, trainY)
#converting the clf.cv_results to dataframe
df=pd.DataFrame.from_dict(clf.cv_results_)
#here Possible inputs for cross validation is cv=2, there two split split0 and split1
df[['split0_test_accuracy','split1_test_accuracy','split0_test_precision','split1_test_precision','split0_test_recall','split1_test_recall']]
根据accuracy_score、precision_score或recall找到最佳参数,并根据测试数据重新拟合模型和预测
#find the best parameter based on the accuracy_score
#taking the average of the accuracy_score
df['accuracy_score']=(df['split0_test_accuracy']+df['split1_test_accuracy'])/2
df.loc[df['accuracy_score'].idxmax()]['params']
对测试数据的预测
clf =GradientBoostingClassifier(criterion='mae',
learning_rate=0.1,
loss='deviance',
max_depth= 5,
max_features='sqrt',
min_samples_leaf= 0.1,
min_samples_split= 0.42727272727272736,
n_estimators=10,
subsample=0.8)
clf.fit(trainX, trainY)
correct_test = correct_data(test)
testX = correct_test[predictor].values
result = clf.predict(testX)
【讨论】:
以上是关于在 python 中使用 gridsearchcv 进行梯度提升分类器的参数调整的主要内容,如果未能解决你的问题,请参考以下文章
使用 GridSearchCV 但不使用 GridSearchCV 时出错 - Python 3.6.7
如何在 python 中的 sklearn 中获取 GridSearchCV 中的选定功能
Python:Ridge 回归 - ''Ridge' 对象在使用 GridSearchCV 后没有属性 'coef_'