r 中的决策树没有与我的训练数据一起形成

Posted

技术标签:

【中文标题】r 中的决策树没有与我的训练数据一起形成【英文标题】:Decision tree in r is not forming with my training data 【发布时间】:2019-04-27 15:05:29 【问题描述】:
library(caret)
library(rpart.plot)
car_df <- read.csv("TrainingDataSet.csv", sep = ',', header = TRUE)
str(car_df)

set.seed(3033)
intrain <- createDataPartition(y = car_df$Result, p= 0.7, list = FALSE)
training <- car_df[intrain,]
testing <- car_df[-intrain,]
dim(training)
dim(testing)
anyNA(car_df)
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
dtree_fit <- train(Result ~., data = training, method = "rpart",
               parms = list(split = "infromation"),
               trControl=trctrl,
               tuneLength = 10)

我收到此警告:

警告消息:在nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : 重采样中缺少值 绩效指标。

我正在尝试使用正面和负面情绪的数量来分类电影是受欢迎还是失败。这是我的数据

  dput(car_df) 

structure(list(MovieName = structure(c(20L, 5L, 31L, 26L, 27L, 
12L, 36L, 29L, 38L, 4L, 6L, 8L, 10L, 15L, 18L, 21L, 24L, 34L, 
35L, 7L, 37L, 25L, 23L, 2L, 11L, 40L, 33L, 28L, 14L, 3L, 17L, 
16L, 32L, 22L, 30L, 1L, 19L, 39L, 9L, 13L), .Label = c("#96Movie", 
"#alphamovie", "#APrivateWar", "#AStarIsBorn", "#BlackPanther", 
"#BohemianRhapsody", "#CCV", "#Creed2", "#CrimesOfGrindelwald", 
"#Deadpool2", "#firstman", "#GameNight", "#GreenBookMovie", "#grinchmovie", 
"#Incredibles2", "#indivisiblemovie", "#InstantFamily", "#JurassicWorld", 
"#KolamaavuKokila", "#Oceans8", "#Overlord", "#PariyerumPerumal", 
"#RalphBreaksTheInternet", "#Rampage", "#Ratchasan", "#ReadyPlayerOne", 
"#RedSparrow", "#RobinHoodMovie", "#Sarkar", "#Seemaraja", "#Skyscraper", 
"#Suspiria", "#TheLastKey", "#TheNun", "#ThugsOfHindostan", "#TombRaider", 
"#VadaChennai", "#Venom", "#Vishwaroopam2", "#WidowsMovie"), class = "factor"), 
    PositivePercent = c(40.10554, 67.65609, 80.46796, 71.34831, 
    45.36082, 68.82591, 46.78068, 63.85787, 47.20497, 32.11753, 
    63.7, 39.2, 82.76553, 88.78613, 72.18274, 72.43187, 31.0089, 
    38.50932, 38.9, 19.9, 84.26854, 29.4382, 58.13953, 86.9281, 
    64.54965, 56, 0, 56.61914, 58.82353, 54.98891, 78.21682, 
    90, 64.3002, 85.8, 51.625, 67.71894, 92.21557, 53.84615, 
    40.12158, 68.08081), NegativePercent = c(11.34565, 21.28966, 
    6.408952, 13.10861, 26.80412, 17.10526, 18.61167, 10.55838, 
    46.48033, 56.231, 9.9, 12.1, 9.018036, 6.473988, 13.90863, 
    16.77149, 63.20475, 42.54658, 40.9, 5.4, 3.907816, 2.022472, 
    10.51567, 3.267974, 15.12702, 15.3, 100, 18.12627, 11.76471, 
    13.41463, 5.775076, 10, 20.08114, 2.1, 5.5, 7.739308, 0, 
    34.61538, 12.86727, 10.70707), Result = structure(c(2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
    1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Flop", "Hit"
    ), class = "factor")), class = "data.frame", row.names = c(NA, 
-40L))

【问题讨论】:

您能否使用dput(car_df) 命令包含您的数据样本? 我已编辑请在@user10626943上方找到它 这是一个警告,而不是错误。你的树是否适合? 不,它不会过滤@desertnaut,但我的所有功能都不是“因素”。在我的 4 列中,有两个是因子,两个是 num。这会是一个原因吗? 我转换为因子但仍然是相同的错误消息 【参考方案1】:
> str(car_df)
'data.frame':   40 obs. of  4 variables:
 $ MovieName      : Factor w/ 40 levels "#96Movie","#alphamovie",..: 20 5 31 26 27 12 36 29 38 4 ...
 $ PositivePercent: num  40.1 67.7 80.5 71.3 45.4 ...
 $ NegativePercent: num  11.35 21.29 6.41 13.11 26.8 ...
 $ Result         : Factor w/ 2 levels "Flop","Hit": 2 2 2 2 2 2 2 2 2 1 ...

> with(car_df, table( Result))
Result
Flop  Hit 
   5   35 

 > dtree_fit
CART 

29 samples
 3 predictor
 2 classes: 'Flop', 'Hit' 

因此,您有 5 次失败的结果,其中一个预测变量是具有 40 个不同值的变量。鉴于您的每个案例都是独一无二的,并且您的结果严重不平衡,这似乎并不令人惊讶。数据的存在并不能保证得出实质性结论的可能性。如果这里有任何错误,那就是 fitter 中缺少代码,会说“真的吗?您认为统计软件包应该能够解决严重的数据缺乏问题?”

顺便说一句:应该是(但不出所料不会清除警告):

(split = "information")

如果您将交叉验证 bin 的数量更改为允许在各个 bin 之间分配 flop 的数字,那么您可以获得非警告结果。鉴于样本量小,它是否具有很大的有效性仍然值得怀疑:

> trctrl <- trainControl(method = "repeatedcv", number = 3, repeats = 3)
 set.seed(3333)
 dtree_fit <- train(Result ~., data = training, method = "rpart",
                    parms = list(split = "infromation"),
                    trControl=trctrl,
                    tuneLength = 10)
# no warning on one of my runs

【讨论】:

没有必要刻薄,有很多更好的方式来表达你的所作所为。 那么我能做些什么来克服这个问题呢?请帮忙。 获取更多数据,或增加传递给 CV 步骤的 flop 数据量。 是上面警告的原因吗?缺乏翻牌数据? @42 我运行了你的代码@42- 正如你所说的警告没有弹出。但是 Accuracy Kappa 0.862963 0 调整参数 'cp' 保持恒定在 0 值

以上是关于r 中的决策树没有与我的训练数据一起形成的主要内容,如果未能解决你的问题,请参考以下文章

如何在 R 中使用 predict 命令来验证我的训练模型决策树

R语言使用party包中的ctree函数构建条件推理决策树(Conditional inference trees)使用plot函数可视化训练好的条件推理决策树条件推理决策树的叶子节点的阴影区域表

机器学习:通俗易懂决策树与随机森林及代码实践

R语言专题,如何使用party包构建决策树?

rpart 不在 R 中创建决策树,SVM 有效

初始决策树与随机森林