R - Caret train()“错误:停止”,“并非在新数据中找到的对象中使用的所有变量名称”
Posted
技术标签:
【中文标题】R - Caret train()“错误:停止”,“并非在新数据中找到的对象中使用的所有变量名称”【英文标题】:R - Caret train() "Error: Stopping" with "Not all variable names used in object found in newdata" 【发布时间】:2021-02-21 17:39:48 【问题描述】:我正在尝试为mushroom data 构建一个简单的Naive Bayes classifer。我想使用所有变量作为分类预测器来预测蘑菇是否可食用。
我正在使用caret 包。
这是我的完整代码:
##################################################################################
# Prepare R and R Studio environment
##################################################################################
# Clear the R studio console
cat("\014")
# Remove objects from environment
rm(list = ls())
# Install and load packages if necessary
if (!require(tidyverse))
install.packages("tidyverse")
library(tidyverse)
if (!require(caret))
install.packages("caret")
library(caret)
if (!require(klaR))
install.packages("klaR")
library(klaR)
#################################
mushrooms <- read.csv("agaricus-lepiota.data", stringsAsFactors = TRUE, header = FALSE)
na.omit(mushrooms)
names(mushrooms) <- c("edibility", "capShape", "capSurface", "cap-color", "bruises", "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape", "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color", "ring-number", "ring-type", "spore-print-color", "population", "habitat")
# convert bruises to a logical variable
mushrooms$bruises <- mushrooms$bruises == 't'
set.seed(1234)
split <- createDataPartition(mushrooms$edibility, p = 0.8, list = FALSE)
train <- mushrooms[split, ]
test <- mushrooms[-split, ]
predictors <- names(train)[2:20] #Create response and predictor data
x <- train[,predictors] #predictors
y <- train$edibility #response
train_control <- trainControl(method = "cv", number = 1) # Set up 1 fold cross validation
edibility_mod1 <- train( #train the model
x = x,
y = y,
method = "nb",
trControl = train_control
)
执行 train() 函数时,我得到以下输出:
Something is wrong; all the Accuracy metric values are missing:
Accuracy Kappa
Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA
Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA
NA's :2 NA's :2
Error: Stopping
In addition: Warning messages:
1: predictions failed for Fold1: usekernel= TRUE, fL=0, adjust=1 Error in predict.NaiveBayes(modelFit, newdata) :
Not all variable names used in object found in newdata
2: model fit failed for Fold1: usekernel=FALSE, fL=0, adjust=1 Error in x[, 2] : subscript out of bounds
3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
脚本运行后的x和y:
> str(x)
'data.frame': 6500 obs. of 19 variables:
$ capShape : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
$ capSurface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
$ cap-color : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
$ bruises : logi TRUE TRUE TRUE TRUE FALSE TRUE ...
$ odor : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
$ gill-attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
$ gill-spacing : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
$ gill-size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
$ gill-color : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
$ stalk-shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
$ stalk-root : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
$ stalk-surface-above-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ stalk-surface-below-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ stalk-color-above-ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ stalk-color-below-ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ veil-type : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
$ veil-color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
$ ring-number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
$ ring-type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
> str(y)
Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
我的环境是:
> R.version
_
platform x86_64-apple-darwin17.0
arch x86_64
os darwin17.0
system x86_64, darwin17.0
status
major 4
minor 0.3
year 2020
month 10
day 10
svn rev 79318
language R
version.string R version 4.0.3 (2020-10-10)
nickname Bunny-Wunnies Freak Out
> RStudio.Version()
$citation
To cite RStudio in publications use:
RStudio Team (2020). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.
A BibTeX entry for LaTeX users is
@Manual,
title = RStudio: Integrated Development Environment for R,
author = RStudio Team,
organization = RStudio, PBC,
address = Boston, MA,
year = 2020,
url = http://www.rstudio.com/,
$mode
[1] "desktop"
$version
[1] ‘1.3.1093’
$release_name
[1] "Apricot Nasturtium"
【问题讨论】:
可能是目标变量的类不平衡问题:stats.stackexchange.com/questions/192884/… 也不确定目标变量是否需要作为因子?您正在以文本形式阅读它...... 我不会因为类不平衡而得到更明确的错误。无论如何都会调查它。 y 是因素,用输出更新问题以显示是否有帮助。 已编辑问题中的 x 和 y 输出显示所有 x 变量都是除一个逻辑变量之外的因子。我会检查 NA,好主意。 【参考方案1】:您尝试做的有点棘手,最简单的贝叶斯实现或至少您正在使用的(来自从 e1071 派生的 kLAR)使用正态分布。详情可以在naiveBayes help page from e1071下查看:
标准的朴素贝叶斯分类器(至少这个实现) 假设预测变量的独立性和高斯 度量预测变量的分布(给定目标类)。为了 具有缺失值的属性,对应的表条目是 为预测而省略。
而且您的预测变量是分类的,所以这可能是有问题的。您可以尝试设置kernel=TRUE
和adjust=1
使其趋于正常,避免kernel=FALSE
会抛出错误。
在此之前我们删除只有1级的列并整理列名,同样在这种情况下更容易使用公式并避免制作虚拟变量:
df = train
levels(df[["veil-type"]])
[1] "p"
df[["veil-type"]]=NULL
colnames(df) = gsub("-","_",colnames(df))
Grid = expand.grid(usekernel=TRUE,adjust=1,fL=c(0.2,0.5,0.8))
mod1 <- train(edibility~.,data=df,
method = "nb", trControl = trainControl(method="cv",number=5),
tuneGrid=Grid
)
mod1
Naive Bayes
6500 samples
21 predictor
2 classes: 'e', 'p'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 5200, 5200, 5200, 5200, 5200
Resampling results across tuning parameters:
fL Accuracy Kappa
0.2 0.9243077 0.8478624
0.5 0.9243077 0.8478624
0.8 0.9243077 0.8478624
Tuning parameter 'usekernel' was held constant at a value of TRUE
Tuning parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0.2, usekernel = TRUE and
adjust = 1.
【讨论】:
如果我的预测变量是非度量的,即分类/名义/因子,为什么 NB 算法需要使用高斯分布或非参数核技术。我是新手,所以请让我知道我在这里缺少什么。我现在正在尝试改用 multinomial_naive_bayes() 函数,我认为它可能更适合我,但我不知道如何进行后期处理,请参阅此处的问题:***.com/questions/64819019/… 模型需要在给定预测变量的情况下评估观察的条件概率,并且大多数假设您的预测变量是高斯的。你可以看到sebastianraschka.com/Articles/…。在本博客的其余部分,它解释了多项式的工作原理以上是关于R - Caret train()“错误:停止”,“并非在新数据中找到的对象中使用的所有变量名称”的主要内容,如果未能解决你的问题,请参考以下文章
R 理解来自 kernlab 的 caret train(tuneLength = ) 和 SVM 方法
R caret train glmnet 最终模型 lambda 值与指定不符
R语言使用caret包对GBM模型进行参数调优实战:Model Training and Parameter Tuning
R语言使用caret包的train函数构建支持向量机SVM模型模型调优自定义设置trainControl函数和tuneLength参数
R语言使用caret包的train函数构建随机森林(random forest)模型模型调优自定义设置trainControl函数和tuneLength参数