基于树的模型的最优超参数调整

Posted

技术标签:

【中文标题】基于树的模型的最优超参数调整【英文标题】:Optimal Hyper-parameter Tuning for Tree Based Models 【发布时间】:2019-02-13 01:35:56 【问题描述】:

我正在尝试生成 5 个机器学习模型并根据网格搜索类对其进行调整,以便以最佳方式调整模型,以便我能够使用它们来预测每天都会出现的新数据。问题是这样做所花费的时间太长了。所以,我的问题是什么级别的参数调整是绝对必要的,但不会花费超过 2 个小时来完成?下面是我使用的调优和分类器的代码:

#Training and Test Sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = .20, 
random_state = 10)

#Classifiers 
dtc = DecisionTreeClassifier()
randf = RandomForestClassifier()
bag = BaggingClassifier()
gradb = GradientBoostingClassifier()
knn = KNeighborsClassifier()
ada = AdaBoostClassifier()

#Hyperparamter Tuning for the Models being used

#Scoring Criteria 
scoring = 'precision': make_scorer(precision_score), 'accuracy': 
make_scorer(accuracy_score)

#Grid Search for the Decision Tree
param_dtc = 'min_samples_split': np.arange(2, 10), 'min_samples_leaf': 
np.arange(.05, .2), 'max_leaf_nodes': np.arange(2, 30)
cv_dtc = GridSearchCV(estimator = dtc, param_grid = param_dtc, cv = 3, 
scoring = scoring, refit='precision', n_jobs=-1)
#Grid Search for the Random Forest Model  
param_randf = 'n_estimators': np.arange(10, 20), 'min_samples_split': 
np.arange(2, 10), 'min_samples_leaf': np.arange(.15, .33), 'max_leaf_nodes': 
np.arange(2, 30), 'bootstrap': ['True', 'False']
cv_randf = GridSearchCV(estimator = randf, param_grid = param_randf, cv = 3, 
scoring = scoring, refit='precision', n_jobs=-1)
#Grid Search for the Bagging Model 
param_bag = 'n_estimators': np.arange(10, 30), 'max_samples': np.arange(2, 
30), 'bootstrap': ['True', 'False'], 'bootstrap_features': ['True', 
'False']
cv_bag = GridSearchCV(estimator = bag, param_grid = param_bag, cv = 3, 
scoring = scoring, refit='precision', n_jobs=-1)
#Grid Search for the Gradient Boosting Model 
param_gradb = 'loss': ['deviance', 'exponential'], 'learning_rate': 
np.arange(.05, .1), 'max_depth': np.arange(2, 10), 'min_samples_split': 
np.arange(2, 10), 'min_samples_leaf': np.arange(.15, .33), 'max_leaf_nodes': 
np.arange(2, 30)
cv_gradb = GridSearchCV(estimator = gradb, param_grid = param_gradb, cv = 3, 
scoring = scoring, refit='precision', n_jobs=-1)
#Grid Search for the Adaptive Boosting Model
param_ada = 'n_estimators': np.arange(10, 30), 'learning_rate': 
np.arange(.05, .1)
cv_ada = GridSearchCV(estimator = ada, param_grid = param_ada, cv = 3, 
scoring = scoring, refit='precision', n_jobs=-1)

train_dict = 'dtc': cv_dtc.fit(x_train, y_train), 'randf': 
cv_randf.fit(x_train, y_train), 'bag': cv_bag.fit(x_train, y_train), 
'gradb': cv_gradb.fit(x_train, y_train), 'ada': cv_ada.fit(x_train, 
y_train)

【问题讨论】:

【参考方案1】:

    您可以考虑一些迭代网格搜索。例如,不要将“n_estimators”设置为 np.arange(10,30),而是将其设置为 [10,15,20,25,30]。是最优参数15,继续[11,13,15,17,19]。你会找到一种方法来自动化这个过程。这将节省大量时间。

    使用您的数据。你正在调整很多超参数。决策树中存在'min_samples_split'、'min_samples_leaf'和'max_leaf_nodes'的影响的交集。可能没有必要定义所有这些。

【讨论】:

以上是关于基于树的模型的最优超参数调整的主要内容,如果未能解决你的问题,请参考以下文章

模型训练和参数优化

模型训练和参数优化

Python使用遗传算法(Evolutionary Algorithm进化算法)构建优化器获取机器学习模型最优超参数组合拟合最佳模型实战+代码

Python使用模拟退火(Simulated Annealing)算法构建优化器获取机器学习模型最优超参数组合(hyperparameter)实战+代码

Python使用灰狼算法(Grey Wolf Optimization (GWO) Algorithm)构建优化器获取机器学习模型最优超参数组合拟合最佳模型实战+代码

R语言使用caret包的train函数构建多元自适应回归样条(MARS)模型查看模型输出结构最优超参数及对应模型评估指标