是否可以更改随机森林中使用的引导和/或子采样方案？

Posted 2023-03-12

技术标签:

【中文标题】是否可以更改随机森林中使用的引导和/或子采样方案？【英文标题】：Is it possible to alter the bootstrapping and/or the subsampling scheme used in randomForest? 【发布时间】：2017-01-09 15:50:42 【问题描述】：

我正在训练一个随机森林来处理多级数据，本质上将其视为非参数回归模型。我将未观察到的组级异质性解释为随机森林训练过程之外的校正。但是在随机森林训练过程中，我希望每棵树都是从横截面单位的随机引导样本（或子样本）而不是观察中生长出来的。因此，假设我的数据是对许多个体的多次观察，我想引导个体，而不是对个体的观察。

下面的虚拟示例表明strata 不能解决我的问题。

> N <- 1000
> p <- 100
> A <- matrix(rnorm(p^2),p)
> library(MASS)
> X <- mvrnorm(N, rep(0,p), A %*%t(A))
> B <- rnorm(p)
> fac <- sample(1:1000 %% 10 +1)
> y <- log(fac + exp(X%*%B)^1/fac) + rnorm(N, sd = 10)
> fac <- as.factor(fac)
> library(randomForest)
> forest <- randomForest(y = y, x = cbind(X, fac), ntree = 1, keep.inbag = TRUE, replace = FALSE
+  , strata = fac                        #Stratify by the factor
+ )
> sum(forest$inbag[fac == '1'])
[1] 62
> sum(forest$inbag[fac == '2'])
[1] 60
> sum(forest$inbag[fac == '3'])
[1] 60
> sum(forest$inbag[fac == '4'])
[1] 64
> sum(forest$inbag[fac == '5'])
[1] 64
> sum(forest$inbag[fac == '6'])
[1] 65
> sum(forest$inbag[fac == '7'])
[1] 54
> sum(forest$inbag[fac == '8'])
[1] 72
> sum(forest$inbag[fac == '9'])
[1] 62
> sum(forest$inbag[fac == '10'])
[1] 69

或者，我可以将单个随机子空间树生长为我自己选择的样本，然后手动组合这些树。这是下面。

"%ni%" <- Negate("%in%")
library(foreach)
rf_cluster_bootstrap <- foreach(j = 1:10) %do% 
  set.seed(j)
  sampfac <- sample(unique(fac), replace = TRUE)
  unsampfac <-unique(fac[fac %ni% sampfac])
  Xt <- foreach(i = sampfac, .combine = rbind) %do% Xmat[Xmat$fac == i,]
  Xt$fac <- NULL
  fj <- randomForest(y = y, x = Xt, ntree = 1, sampsize = nrow(Xt), replace = FALSE, keep.inbag = TRUE)
  Xnt <- foreach(i = unsampfac, .combine = rbind) %do% Xmat[Xmat$fac == i,]
  Xnt$fac <- NULL
  pred <- predict(fj, newdata = Xnt)
  oob_outvec <- rep(NA, N)
  oob_outvec[as.numeric(names(pred))] <- pred
  return(list(fj = fj, oob = oob_outvec))

虽然这似乎可行，但我需要编写自己的预测函数，跟踪行名等。等等。可能会出现编码错误和其他意想不到的事情。例如，这是一个合并输出的函数：

combineFunc <- function(x)
  rflist <- lapply(x, `[[`, 'fj')
  # rf <- randomForest::combine(rflist)#I don't know why this doesn't work
  rf <- foreach(i = 1:10, .combine = randomForest::combine) %do% rflist[[i]]
  return(rf)


xx <- combineFunc(rf_cluster_bootstrap)
head(xx$inbag)
    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
101    0    1    0    3    1    0    0    0    0     0
102    3    0    1    0    2    2    1    3    2     1
103    0    1    0    2    1    1    0    1    0     1
104    0    2    1    0    3    0    1    0    2     0
105    1    0    0    1    1    0    0    1    1     0
106    0    0    3    1    1    0    1    1    2     1

inbag 矩阵之类的基本内容是乱码。我可以修复它，但我不太可能抓住一切。

在我从头开始做这件事之前，我想知道是否有一些已经实现的东西可以做我想做的事情？还是更简单/更优雅的方式？

（This thread类似，但使用rpart，无法处理随机子空间）

【问题讨论】：

【参考方案1】：

我会将此作为评论留下，但我还没有足够的声誉。这不是一个明确的答案，但我想为其他在类似搜索中看到这篇文章的人留下一些东西。

我没有亲自尝试过，但这是我遇到的 R 中随机森林的阻塞引导的相对较新的实现 (blockForest https://rdrr.io/cran/blockForest/)，它似乎正在朝着你想做的方向发展。

引用：https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2942-y

【讨论】：

以上是关于是否可以更改随机森林中使用的引导和/或子采样方案？的主要内容，如果未能解决你的问题，请参考以下文章