根据R中data.frame行中类的频率分配类

Posted 2023-03-12

技术标签:

【中文标题】根据R中data.frame行中类的频率分配类【英文标题】：Assign classes based on frequencies of classes in rows of a data.frame in R 【发布时间】：2014-06-14 12:45:38 【问题描述】：

我运行了 8 种不同的分类模型（类别“0”、“1”、“-1”表示“中性”、“正面”、“负面”），我正在尝试将它们结合起来。最后，结果应该作为附加列添加到我的 data.frame 中。现在以 excel 为例，这不会太难，但我只是不知道如何在 R 中做这样的事情。好吧，首先是我的 data.frame：

MAXENTROPY <- c("1","1","1","1","0","-1","-1","1","-1","0")
SVM <- c("1","1","1","1","0","-1","-1","0","-1","0") 
BAGGING <- c("0","1","1","1","-1","-1","-1","1","-1","1")
LOGITBOOST <- c("0","1","1","1","0","-1","-1","1","-1","1")
NNETWORK <- c("-1","1","1","1","-1","-1","-1","1","-1","0")
FORESTS <- c("0","1","1","1","1","-1","-1","1","-1","1")
SLDA <- c("0","1","1","1","0","-1","0","1","-1","0")
TREE <- c("1","1","1","1","1","-1","-1","1","-1","0")

results.allm <- data.frame(MAXENTROPY,SVM,BAGGING,
                       LOGITBOOST,NNETWORK,FORESTS,
                       SLDA,TREE)

results.allm

#    MAXENTROPY SVM BAGGING LOGITBOOST NNETWORK FORESTS SLDA TREE
# 1           1   1       0          0       -1       0    0    1
# 2           1   1       1          1        1       1    1    1
# 3           1   1       1          1        1       1    1    1
# 4           1   1       1          1        1       1    1    1
# 5           0   0      -1          0       -1       1    0    1
# 6          -1  -1      -1         -1       -1      -1   -1   -1
# 7          -1  -1      -1         -1       -1      -1    0   -1
# 8           1   0       1          1        1       1    1    1
# 9          -1  -1      -1         -1       -1      -1   -1   -1
# 10          0   0       1          1        0       1    0    0

我想根据这些行（第 1-8 行）中类的频率添加几列：

第一列：如果所有列都显示相同的类别，则仅分配类别。如果不; ""

第二栏：多数票，分配频率最高的类别。如果两个类在同一行中具有相同的最高频率，则以 0.5 的概率分配其中一个。

第 3 列：类似于第 2 行，但如果一行中只有 0 和 1 或 -1（如第 10 行），则分配类 1 或 -1

【问题讨论】：

请提供您想要的确切结果。 【参考方案1】：

这是使用apply 获取第一列的方法：

# Use a list of the classifier names to make sure you're only
# counting their votes
classifier.names <- names(results.allm)

# Apply over each row (MARGIN = 1)
results.allm$consensus <- apply(results.allm[classifier.names],
                                MARGIN = 1,
                                FUN = function(x) 

    # If all elements match the first element...
    ifelse(all(x %in% x[1]),
           yes = x[1], # ... return that element.
           no = "") # Depending on your purpose, NA might be better
    
)

这是您第二个专栏的一种方法 - 我假设您的意思是 plurality 投票而不是多数票（即，他们不必拥有超过 50% 的选票，只要最多）。

results.allm$plurality <- apply(results.allm[classifier.names],
                                MARGIN = 1,
                                FUN = function(x) 

    # Tally up the votes
    xtab <- table(unlist(x))

    # Get the classes with the most votes
    maxclass <- names(xtab)[xtab %in% max(xtab)]

    # Sample from maxclass with equal probability for each tied class
    sample(maxclass, size = 1)

)

这是您第三列的粗略尝试。基本上，我正在检查（在ifelse 内）以查看该行是否完全由 0 和 1 组成；如果是，我返回 1。

如果不是，我检查它是否完全由 0 和 -1 组成；如果是，我返回 -1。

否则，该函数将返回与上述第二种方法相同的结果。

results.allm$third <- apply(results.allm[classifier.names],
                            MARGIN = 1,
                            FUN = function(x) 

    # Tally up the votes
    xtab <- table(unlist(x))

    # If the result sets are (0, 1) or (0, -1), return the non-zero class
    maxclass <- ifelse(all(names(xtab) %in% c("0", "1")),
                       yes = "1",
                       no = ifelse(all(names(xtab) %in% c("0", "-1")),
                                   yes = "-1",
                                   no = names(xtab)[xtab %in% max(xtab)]
        )
    )


    # Sample from maxclass with equal probability for each tied class
    sample(maxclass, size = 1)

)

上面的代码都没有经过检查以查看它在存在NAs 时的行为，所以如果您有任何可能产生NAs 的分类器，请当心！

【讨论】：

以上是关于根据R中data.frame行中类的频率分配类的主要内容，如果未能解决你的问题，请参考以下文章