从 mlr 包的 resample 函数中获取特定的随机森林变量重要性度量

Posted 2023-03-12

技术标签:

【中文标题】从 mlr 包的 resample 函数中获取特定的随机森林变量重要性度量【英文标题】：Getting a specific random forest variable importance measure from mlr package's resample function 【发布时间】：2020-03-14 07:52:51 【问题描述】：

我正在使用 mlr 包的 resample() 函数对随机森林模型进行 4000 次子采样（下面的代码 sn-p）。

如您所见，要在 resample() 中创建随机森林模型，我使用的是 randomForest 包。

我想获得每个子样本迭代的随机森林模型的重要性结果（所有类的准确度均值下降）。作为重要性衡量标准，我现在可以得到的是基尼指数的平均下降。

我可以从mlr的源代码中看到，makeRLearner.classif.randomForest中的getFeatureImportanceLearner.classif.randomForest()函数（第69行）使用randomForest::importance()函数（第83行）从randomForest的结果对象中获取重要性值class .但是从源代码（第 73 行）可以看出，它使用 2L 作为默认值。我希望它使用 1L（第 75 行）作为值（平均精度下降）。

如何将 2L 的值传递给 resample() 函数（下面代码中的“extract = getFeatureImportance”行），以便 getFeatureImportanceLearner.classif.randomForest() 函数获取该值并设置 ctrl$type = 2L（第 73 行）？

rf_task <- makeClassifTask(id = 'task',
                           data = data[, -1], target = 'target_var',
                           positive = 'positive_var')

rf_learner <- makeLearner('classif.randomForest', id = 'random forest',
                          par.vals = list(ntree = 1000, importance = TRUE),
                          predict.type = 'prob')

base_subsample_instance <- makeResampleInstance(rf_boot_desc, rf_task)

rf_subsample_result <- resample(rf_learner, rf_task,
                                base_subsample_instance,
                                extract = getFeatureImportance,
                                measures = list(acc, auc, tpr, tnr,
                                                ppv, npv, f1, brier))

我的解决方案：下载 mlr 包的源代码。将源文件第 73 行更改为 1L (https://github.com/mlr-org/mlr/blob/v2.15.0/R/RLearner_classif_randomForest.R)。从命令行安装包并使用它。不是最佳解决方案，而是解决方案。

【问题讨论】：

【参考方案1】：

您提供的许多细节实际上与您的问题无关，至少我是如何理解的。所以我写了一个包含答案的简单 MWE。这个想法是您必须为getFeatureImportance 编写一个简短的包装器，以便您可以传递自己的参数。 purrr 的粉丝可以使用 purrr::partial(getFeatureImportance, type = 2) 做到这一点，但这里我手动写了 myExtractor。

library(mlr)
rf_learner <- makeLearner('classif.randomForest', id = 'random forest',
                          par.vals = list(ntree = 100, importance = TRUE),
                          predict.type = 'prob')

measures = list(acc, auc, tpr, tnr,
                ppv, npv, f1, brier)

myExtractor = function(.model, ...) 
  getFeatureImportance(.model, type = 2, ...)


res = resample(rf_learner, sonar.task, cv10, 
               measures = measures, extract = myExtractor)

# first feature importance result:
res$extract[[1]]

# all values in a matrix:
sapply(res$extract, function(x) x$res)

如果你想做自举学习，也许你也应该看看makeBaggingWrapper，而不是通过resample解决这个问题。

【讨论】：

以上是关于从 mlr 包的 resample 函数中获取特定的随机森林变量重要性度量的主要内容，如果未能解决你的问题，请参考以下文章