在 R 中使用纯 ranger 包进行超参数调整

Posted 2023-02-23

技术标签:

【中文标题】在 R 中使用纯 ranger 包进行超参数调整【英文标题】：Hyper-parameter tuning using pure ranger package in R 【发布时间】：2016-09-27 15:08:03 【问题描述】：

喜欢随机森林模型创建的游侠包的速度，但看不到如何调整 mtry 或树的数量。我意识到我可以通过 caret 的 train() 语法来做到这一点，但我更喜欢使用纯 ranger 带来的速度提升。

这是我使用 ranger 创建基本模型的示例（效果很好）：

library(ranger)
data(iris)

fit.rf = ranger(
  Species ~ .,
  training_data = iris,
  num.trees = 200
)

print(fit.rf)

查看调整选项的官方文档，似乎 csrf() 函数可以提供调整超参数的能力，但我无法正确使用语法：

library(ranger)
data(iris)

fit.rf.tune = csrf(
  Species ~ .,
  training_data = iris,
  params1 = list(num.trees = 25, mtry=4),
  params2 = list(num.trees = 50, mtry=4)
)

print(fit.rf.tune)

结果：

Error in ranger(Species ~ ., training_data = iris, num.trees = 200) : 
  unused argument (training_data = iris)

我更喜欢使用 ranger 提供的常规（阅读：非 csrf）rf 算法进行调整。关于 ranger 中任一路径的超参数调整解决方案的任何想法？谢谢！

【问题讨论】：

【参考方案1】：

为了回答我的（不清楚的）问题，显然 ranger 没有内置的 CV/GridSearch 功能。但是，这是在插入符号之外使用 ranger（通过网格搜索）进行超参数调整的方法。感谢 Marvin Wright（ranger 的维护者）提供代码。结果发现带有游侠的插入符号简历对我来说很慢，因为我使用的是公式界面（应该避免）。

ptm <- proc.time()
library(ranger)
library(mlr)

# Define task and learner
task <- makeClassifTask(id = "iris",
                        data = iris,
                        target = "Species")

learner <- makeLearner("classif.ranger")

# Choose resampling strategy and define grid
rdesc <- makeResampleDesc("CV", iters = 5)
ps <- makeParamSet(makeIntegerParam("mtry", 3, 4),
                   makeDiscreteParam("num.trees", 200))

# Tune
res = tuneParams(learner, task, rdesc, par.set = ps,
           control = makeTuneControlGrid())

# Train on entire dataset (using best hyperparameters)
lrn = setHyperPars(makeLearner("classif.ranger"), par.vals = res$x)
m = train(lrn, iris.task)

print(m)
print(proc.time() - ptm) # ~6 seconds

对于好奇的人，插入符号等效于

ptm <- proc.time()
library(caret)
data(iris)

grid <-  expand.grid(mtry = c(3,4))

fitControl <- trainControl(method = "CV",
                           number = 5,
                           verboseIter = TRUE)

fit = train(
  x = iris[ , names(iris) != 'Species'],
  y = iris[ , names(iris) == 'Species'],
  method = 'ranger',
  num.trees = 200,
  tuneGrid = grid,
  trControl = fitControl
)
print(fit)
print(proc.time() - ptm) # ~2.4 seconds

总的来说，如果使用非公式接口，插入符号是使用 ranger 进行网格搜索的最快方法。

【讨论】：

感谢您提供这些解决方案。快速提问，是否可以在搜索网格中包含 num.tree 超参数列表？【参考方案2】：

我认为至少有两个错误：

首先，函数ranger 没有名为training_data 的参数。您的错误消息Error in ranger(Species ~ ., training_data = iris, num.trees = 200) : unused argument (training_data = iris) 指的是那个。当您查看?ranger 或args(ranger) 时，您会看到这一点。

其次，函数csrf，另一方面，有training_data作为输入，但也需要test_data。最重要的是，这两个参数没有任何默认值，这意味着您必须提供它们。以下工作没有问题：

fit.rf = ranger(
  Species ~ ., data = iris,
  num.trees = 200
)

fit.rf.tune = csrf(
Species ~ .,
training_data = iris,
test_data = iris,
params1 = list(num.trees = 25, mtry=4),
params2 = list(num.trees = 50, mtry=4)
)

在这里，我刚刚提供了iris 作为训练和测试数据集。您显然不想在实际应用程序中这样做。此外，请注意ranger 也将num.trees 和mtry 作为输入，因此您可以尝试在那里调整它。

【讨论】：

很棒的信息，谢谢！据您所知，在 ranger 中没有非 csrf 的超参数调优途径吗？另外，哲元，我最初确实问过是否有非 csrf 选项可用（而不仅仅是为了修复文档中的 csrf 实现）。非常慷慨，伙计们，谢谢。请注意，coffeinjunky——尽管我发布的错误消息说我使用了 ranger 函数，但我实际上使用了 csrf 函数（不确定您是否要编辑您的回复）。我将通过电子邮件向 Marvin Wright（维护者）发送有关此问题的 FYI。再次感谢！另外，coffeinjunky，如果您正在编辑，您是否介意添加一个 param1、param2 语法示例以使用 ranger 函数进行调整？谢谢！只需将num.trees=5 或任何其他号码，或mtry=5 或任何其他号码添加到您的电话中。如ranger(Species ~ ., data = iris, num.trees = 200, mtry=5)【参考方案3】：

请注意，mlr 默认禁用 ranger 的内部并行化。将超参数 num.threads 设置为可用于加速 mlr 的内核数：

learner <- makeLearner("classif.ranger", num.threads = 4)

或者，通过启动并行后端

parallelStartMulticore(4) # linux/osx
parallelStartSocket(4)    # windows

在调用tuneParams 以并行化调优之前。

【讨论】：

【参考方案4】：

调整模型的另一种方法是创建手动网格，也许有更好的方法来训练模型，但这可能是不同的选择。

hyper_grid <- expand.grid(
  mtry       = 1:4,
  node_size  = 1:3,
  num.trees = seq(50,500,50),
  OOB_RMSE   = 0
)

system.time(
  for(i in 1:nrow(hyper_grid)) 
    # train model
    rf <- ranger(
      formula        = Species ~ .,
      data           = iris,
      num.trees      = hyper_grid$num.trees[i],
      mtry           = hyper_grid$mtry[i],
      min.node.size  = hyper_grid$node_size[i],
      importance = 'impurity')
    # add OOB error to grid
    hyper_grid$OOB_RMSE[i] <- sqrt(rf$prediction.error)
  )
user  system elapsed 
3.17    0.19    1.36

nrow(hyper_grid) # 120 models
position = which.min(hyper_grid$OOB_RMSE)
head(hyper_grid[order(hyper_grid$OOB_RMSE),],5)
     mtry node_size num.trees     OOB_RMSE
6     2         2        50 0.1825741858
23    3         3       100 0.1825741858
3     3         1        50 0.2000000000
11    3         3        50 0.2000000000
14    2         1       100 0.2000000000

# fit best model
rf.model <- ranger(Species ~ .,data = iris, num.trees = hyper_grid$num.trees[position], importance = 'impurity', probability = FALSE, min.node.size = hyper_grid$node_size[position], mtry = hyper_grid$mtry[position])
rf.model
Ranger result

Call:
 ranger(Species ~ ., data = iris, num.trees = hyper_grid$num.trees[position], importance = "impurity", probability = FALSE, min.node.size = hyper_grid$node_size[position], mtry = hyper_grid$mtry[position]) 

    Type:                             Classification 
Number of trees:                  50 
Sample size:                      150 
Number of independent variables:  4 
Mtry:                             2 
Target node size:                 2 
Variable importance mode:         impurity 
Splitrule:                        gini 
OOB prediction error:             5.33 %

希望对你有用。

【讨论】：

【参考方案5】：

还有tuneRanger R 包，它专为调整 ranger 而设计，使用预定义的调整参数、超参数空间和通过袋外观察进行智能调整。

请注意，随机森林不是一种算法，通常调整会产生很大的不同。但通常它可以提高一点性能。

【讨论】：

以上是关于在 R 中使用纯 ranger 包进行超参数调整的主要内容，如果未能解决你的问题，请参考以下文章

使用具有多个类的 RandomizedSearchCV 进行 XGBoost 超参数调整

如何进行超参数调整？

使用 GridSearchCV 进行超参数调整

TensorFlow 的超参数调整

同时进行特征选择和超参数调整

多层感知机MLP常见的超参数有哪些？如果MLP模型对于数据集过拟合了，如何调整这些超参数来进行解决？