使用 caret/gbm 的多项分类器的 mnLogloss 错误

Posted

技术标签:

【中文标题】使用 caret/gbm 的多项分类器的 mnLogloss 错误【英文标题】:error with mnLogloss for multinomial classifier using caret/gbm 【发布时间】:2020-10-12 20:52:06 【问题描述】:

我正在尝试执行多项分类器。它似乎有效,我能够生成一个最小化 logLoss 与提升迭代的图,但是我无法提取错误值。这是我运行 mnLogLoss 函数时的错误。

Error in mnLogLoss(predicted, lev = predicted$label) : 
  'data' should have columns consistent with 'lev'
data has been partitioned into.
-training
-testing
-in both, the column "label" contains the ground truth

library(MLmetrics)
fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3, verboseIter = FALSE,
                           savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)


gbmGrid1 <- expand.grid(.interaction.depth = (1:3), .n.trees = (1:10)*20, .shrinkage = 0.01, .n.minobsinnode = 3)

system.time(
  gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,
                   verbose = 1, metric = "logLoss", tuneGrid = gbmGrid1)
)

gbmPredictions <- predict(gbmFit1, testing)
predicted <- cbind(gbmPredictions, testing)

mnLogLoss(predicted, lev = levels(predicted$label))

【问题讨论】:

【参考方案1】:

对于 mnLogLoss,它在小插图中说:

data: a data frame with columns ‘obs’ and ‘pred’ for the observed
          and predicted outcomes. For metrics that rely on class
          probabilities, such as ‘twoClassSummary’, columns should also
          include predicted probabilities for each class. See the
          ‘classProbs’ argument to ‘trainControl’.

所以它不要求训练数据。这里的data参数只是一个输入,所以我使用了一些模拟数据:

library(caret)

df = data.frame(label=factor(sample(c("a","b"),100,replace=TRUE)),
matrix(runif(500),ncol=50))
training = df[1:50,]
testing = df[1:50,]

fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3, verboseIter = FALSE,
                           savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)

gbmGrid1 <- expand.grid(.interaction.depth = (1:3), .n.trees = (1:10)*20, .shrinkage = 0.01, .n.minobsinnode = 3)

gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,verbose = 1, metric = "logLoss", tuneGrid = gbmGrid1)
)

我们将obspred 放在一起,最后两列是每个类别的概率:

predicted <- data.frame(obs=testing$label,
pred=predict(gbmFit1, testing),
predict(gbmFit1, testing,type="prob"))

head(predicted)

  obs pred         a         b
1   b    a 0.5506054 0.4493946
2   b    a 0.5107631 0.4892369
3   a    b 0.4859799 0.5140201
4   b    a 0.5090264 0.4909736
5   b    b 0.4545746 0.5454254
6   a    a 0.6211514 0.3788486

mnLogLoss(predicted, lev = levels(predicted$obs))
  logLoss 
0.6377392

【讨论】:

以上是关于使用 caret/gbm 的多项分类器的 mnLogloss 错误的主要内容,如果未能解决你的问题,请参考以下文章

R中的多项朴素贝叶斯分类器

使用 3 个最高概率的多类分类器的性能

寻找简单阈值分类器的多类阈值

如何发布带有分类器的多模块项目?

使用 Python 示例对多项朴素贝叶斯分类器进行分类

scikit learn 使用多项式朴素贝叶斯作为三元分类器?