根据R中data.frame行中类的频率分配类
Posted
技术标签:
【中文标题】根据R中data.frame行中类的频率分配类【英文标题】:Assign classes based on frequencies of classes in rows of a data.frame in R 【发布时间】:2014-06-14 12:45:38 【问题描述】:我运行了 8 种不同的分类模型(类别“0”、“1”、“-1”表示“中性”、“正面”、“负面”),我正在尝试将它们结合起来。最后,结果应该作为附加列添加到我的 data.frame 中。现在以 excel 为例,这不会太难,但我只是不知道如何在 R 中做这样的事情。好吧,首先是我的 data.frame:
MAXENTROPY <- c("1","1","1","1","0","-1","-1","1","-1","0")
SVM <- c("1","1","1","1","0","-1","-1","0","-1","0")
BAGGING <- c("0","1","1","1","-1","-1","-1","1","-1","1")
LOGITBOOST <- c("0","1","1","1","0","-1","-1","1","-1","1")
NNETWORK <- c("-1","1","1","1","-1","-1","-1","1","-1","0")
FORESTS <- c("0","1","1","1","1","-1","-1","1","-1","1")
SLDA <- c("0","1","1","1","0","-1","0","1","-1","0")
TREE <- c("1","1","1","1","1","-1","-1","1","-1","0")
results.allm <- data.frame(MAXENTROPY,SVM,BAGGING,
LOGITBOOST,NNETWORK,FORESTS,
SLDA,TREE)
results.allm
# MAXENTROPY SVM BAGGING LOGITBOOST NNETWORK FORESTS SLDA TREE
# 1 1 1 0 0 -1 0 0 1
# 2 1 1 1 1 1 1 1 1
# 3 1 1 1 1 1 1 1 1
# 4 1 1 1 1 1 1 1 1
# 5 0 0 -1 0 -1 1 0 1
# 6 -1 -1 -1 -1 -1 -1 -1 -1
# 7 -1 -1 -1 -1 -1 -1 0 -1
# 8 1 0 1 1 1 1 1 1
# 9 -1 -1 -1 -1 -1 -1 -1 -1
# 10 0 0 1 1 0 1 0 0
我想根据这些行(第 1-8 行)中类的频率添加几列:
第一列:如果所有列都显示相同的类别,则仅分配类别。如果不; ""
第二栏:多数票,分配频率最高的类别。如果两个类在同一行中具有相同的最高频率,则以 0.5 的概率分配其中一个。
第 3 列:类似于第 2 行,但如果一行中只有 0 和 1 或 -1(如第 10 行),则分配类 1 或 -1
【问题讨论】:
请提供您想要的确切结果。 【参考方案1】:这是使用apply
获取第一列的方法:
# Use a list of the classifier names to make sure you're only
# counting their votes
classifier.names <- names(results.allm)
# Apply over each row (MARGIN = 1)
results.allm$consensus <- apply(results.allm[classifier.names],
MARGIN = 1,
FUN = function(x)
# If all elements match the first element...
ifelse(all(x %in% x[1]),
yes = x[1], # ... return that element.
no = "") # Depending on your purpose, NA might be better
)
这是您第二个专栏的一种方法 - 我假设您的意思是 plurality 投票而不是多数票(即,他们不必拥有超过 50% 的选票,只要最多)。
results.allm$plurality <- apply(results.allm[classifier.names],
MARGIN = 1,
FUN = function(x)
# Tally up the votes
xtab <- table(unlist(x))
# Get the classes with the most votes
maxclass <- names(xtab)[xtab %in% max(xtab)]
# Sample from maxclass with equal probability for each tied class
sample(maxclass, size = 1)
)
这是您第三列的粗略尝试。基本上,我正在检查(在ifelse
内)以查看该行是否完全由 0 和 1 组成;如果是,我返回 1。
如果不是,我检查它是否完全由 0 和 -1 组成;如果是,我返回 -1。
否则,该函数将返回与上述第二种方法相同的结果。
results.allm$third <- apply(results.allm[classifier.names],
MARGIN = 1,
FUN = function(x)
# Tally up the votes
xtab <- table(unlist(x))
# If the result sets are (0, 1) or (0, -1), return the non-zero class
maxclass <- ifelse(all(names(xtab) %in% c("0", "1")),
yes = "1",
no = ifelse(all(names(xtab) %in% c("0", "-1")),
yes = "-1",
no = names(xtab)[xtab %in% max(xtab)]
)
)
# Sample from maxclass with equal probability for each tied class
sample(maxclass, size = 1)
)
上面的代码都没有经过检查以查看它在存在NA
s 时的行为,所以如果您有任何可能产生NA
s 的分类器,请当心!
【讨论】:
以上是关于根据R中data.frame行中类的频率分配类的主要内容,如果未能解决你的问题,请参考以下文章