根据R中data.frame行中类的频率分配类

Posted

技术标签:

【中文标题】根据R中data.frame行中类的频率分配类【英文标题】:Assign classes based on frequencies of classes in rows of a data.frame in R 【发布时间】:2014-06-14 12:45:38 【问题描述】:

我运行了 8 种不同的分类模型(类别“0”、“1”、“-1”表示“中性”、“正面”、“负面”),我正在尝试将它们结合起来。最后,结果应该作为附加列添加到我的 data.frame 中。现在以 excel 为例,这不会太难,但我只是不知道如何在 R 中做这样的事情。好吧,首先是我的 data.frame:

MAXENTROPY <- c("1","1","1","1","0","-1","-1","1","-1","0")
SVM <- c("1","1","1","1","0","-1","-1","0","-1","0") 
BAGGING <- c("0","1","1","1","-1","-1","-1","1","-1","1")
LOGITBOOST <- c("0","1","1","1","0","-1","-1","1","-1","1")
NNETWORK <- c("-1","1","1","1","-1","-1","-1","1","-1","0")
FORESTS <- c("0","1","1","1","1","-1","-1","1","-1","1")
SLDA <- c("0","1","1","1","0","-1","0","1","-1","0")
TREE <- c("1","1","1","1","1","-1","-1","1","-1","0")

results.allm <- data.frame(MAXENTROPY,SVM,BAGGING,
                       LOGITBOOST,NNETWORK,FORESTS,
                       SLDA,TREE)

results.allm

#    MAXENTROPY SVM BAGGING LOGITBOOST NNETWORK FORESTS SLDA TREE
# 1           1   1       0          0       -1       0    0    1
# 2           1   1       1          1        1       1    1    1
# 3           1   1       1          1        1       1    1    1
# 4           1   1       1          1        1       1    1    1
# 5           0   0      -1          0       -1       1    0    1
# 6          -1  -1      -1         -1       -1      -1   -1   -1
# 7          -1  -1      -1         -1       -1      -1    0   -1
# 8           1   0       1          1        1       1    1    1
# 9          -1  -1      -1         -1       -1      -1   -1   -1
# 10          0   0       1          1        0       1    0    0

我想根据这些行(第 1-8 行)中类的频率添加几列:

第一列:如果所有列都显示相同的类别,则仅分配类别。如果不; ""

第二栏:多数票,分配频率最高的类别。如果两个类在同一行中具有相同的最高频率,则以 0.5 的概率分配其中一个。

第 3 列:类似于第 2 行,但如果一行中只有 0 和 1 或 -1(如第 10 行),则分配类 1 或 -1

【问题讨论】:

请提供您想要的确切结果。 【参考方案1】:

这是使用apply 获取第一列的方法:

# Use a list of the classifier names to make sure you're only
# counting their votes
classifier.names <- names(results.allm)

# Apply over each row (MARGIN = 1)
results.allm$consensus <- apply(results.allm[classifier.names],
                                MARGIN = 1,
                                FUN = function(x) 

    # If all elements match the first element...
    ifelse(all(x %in% x[1]),
           yes = x[1], # ... return that element.
           no = "") # Depending on your purpose, NA might be better
    
)

这是您第二个专栏的一种方法 - 我假设您的意思是 plurality 投票而不是多数票(即,他们不必拥有超过 50% 的选票,只要最多)。

results.allm$plurality <- apply(results.allm[classifier.names],
                                MARGIN = 1,
                                FUN = function(x) 

    # Tally up the votes
    xtab <- table(unlist(x))

    # Get the classes with the most votes
    maxclass <- names(xtab)[xtab %in% max(xtab)]

    # Sample from maxclass with equal probability for each tied class
    sample(maxclass, size = 1)

)

这是您第三列的粗略尝试。基本上,我正在检查(在ifelse 内)以查看该行是否完全由 0 和 1 组成;如果是,我返回 1。

如果不是,我检查它是否完全由 0 和 -1 组成;如果是,我返回 -1。

否则,该函数将返回与上述第二种方法相同的结果。

results.allm$third <- apply(results.allm[classifier.names],
                            MARGIN = 1,
                            FUN = function(x) 

    # Tally up the votes
    xtab <- table(unlist(x))

    # If the result sets are (0, 1) or (0, -1), return the non-zero class
    maxclass <- ifelse(all(names(xtab) %in% c("0", "1")),
                       yes = "1",
                       no = ifelse(all(names(xtab) %in% c("0", "-1")),
                                   yes = "-1",
                                   no = names(xtab)[xtab %in% max(xtab)]
        )
    )


    # Sample from maxclass with equal probability for each tied class
    sample(maxclass, size = 1)

)

上面的代码都没有经过检查以查看它在存在NAs 时的行为,所以如果您有任何可能产生NAs 的分类器,请当心!

【讨论】:

以上是关于根据R中data.frame行中类的频率分配类的主要内容,如果未能解决你的问题,请参考以下文章

如何获得R中每日时间序列的分类频率

R循环有条件

频率分布的中位数

使用 R 中的频率表确定收入等级的中位数

Java中类,对象,方法的内存分配

制作具有 2 个因子的频率选项卡的最简单的 R 函数是啥?