如何在数据框中聚合具有多列的重复行[重复]

Posted 2023-02-19

技术标签:

【中文标题】如何在数据框中聚合具有多列的重复行[重复]【英文标题】：How to aggregate duplicate rows with multiple columns in data frame [duplicate] 【发布时间】：2017-10-29 01:03:14 【问题描述】：

我有一个看起来像这样的data.frame（但是有更多的列和行）：

    Gene      Cell1    Cell2    Cell3     
1      A          2        7        8 
2      A          5        2        9 
3      B          2        7        8
4      C          1        4        3

我想对Gene 中具有相同值的行求和，以获得如下结果：

    Gene      Cell1    Cell2    Cell3     
1      A          7        9       17  
2      B          2        7        8
3      C          1        4        3

根据之前问题的答案，我尝试使用aggregate，但我无法理解如何获得上述结果。这是我尝试过的：

aggregate(df[,-1], list(df[,1]), FUN = sum)

有人知道我做错了什么吗？

【问题讨论】：

聚合结果有什么问题？ 【参考方案1】：

或者dplyr:

library(dplyr)
df %>%
  group_by(Gene) %>%
  summarise_all(sum) %>%
  data.frame() -> newdf # so that newdf can further be used, if needed

【讨论】：

其他方法也可以，但这种方法更健壮也更直观。我喜欢那个不需要声明要求和的列。【参考方案2】：

aggregate(df[,-1], list(Gene=df[,1]), FUN = sum)
#   Gene Cell1 Cell2 Cell3
# 1    A     7     9    17
# 2    B     2     7     8
# 3    C     1     4     3

会给你你正在寻找的输出。

【讨论】：

有一个错误，当我们运行上面的时候：Error in aggregate.data.frame(df[, -1], list(Gene = df[, 1]), FUN = sum) : arguments must have same length @ManojKumar 请将str(df) 的输出添加到您的帖子中。当然@lukeA在这里：

Classes ‘data.table’ and 'data.frame':	4 obs. of  4 variables:  $ Gene : chr  "A" "A" "B" "C"  $ Cell1: int  2 5 2 1  $ Cell2: int  7 2 7 4  $ Cell3: int  8 9 8 3  - attr(*, ".internal.selfref")=&lt;externalptr&gt;

@ManojKumar 谢谢。你有一个数据表对象；那里的索引有点不同。所以你可以例如做aggregate(df[,-1], list(Gene=df[[1]]), FUN = sum)。但是如果你有一个数据表，你可能想利用它的优势来聚合数据； df[, lapply(.SD, sum), by=Gene].

以上是关于如何在数据框中聚合具有多列的重复行[重复]的主要内容，如果未能解决你的问题，请参考以下文章

在python数据框中删除不包含列中特定字符串的多列的重复项

BigQuery - 具有范围聚合的查询中的重复行

根据多列和日期时间删除重复项

如何使用列列表从数据框中删除多列[重复]

从数据框中删除行的命令[重复]

如何根据自定义逻辑在 spark 数据框中删除重复行？