从数据框中删除“重复”行（它们在几列中有所不同）[重复]

Posted 2023-03-29

技术标签:

【中文标题】从数据框中删除“重复”行（它们在几列中有所不同）[重复]【英文标题】：Remove "duplicated" rows from data frame (they differ in few columns) [duplicate] 【发布时间】：2015-09-18 19:29:21 【问题描述】：

所以我对这个问题有类似的问题： Remove duplicate rows in R

就我而言，我想保留所有列（不像建议在前 3 列上使用 unique 函数）。如果提到的两个列中的“值”相同，我想只考虑数据框中的 2 列，并且只保留 1 行。

数据如下：

structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L, 
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple", 
"Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L, 
4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange", 
"Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(NA, 
NA, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"), 
    P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair", 
    "Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"), 
    P2_location_subacon = structure(c(1L, 1L, 1L, 1L, NA, NA, 
    NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge", 
    "Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L, 
    3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed", 
    "Table,Shelf,Fridge"), class = "factor")), .Names = c("P1", 
"P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon", 
"P2_location_all_predictors"), row.names = c(NA, -20L), class = "data.frame")

对我来说重要的列是：P1 和 P2。我想只保留其中一排我们可以吃同样的水果/蔬菜。（请记住，两列中的水果/蔬菜必须相同）：

例子：

之前：

       P1       P2 P1_location_subacon            P1_location_all_predictors P2_location_subacon P2_location_all_predictors
1   Apple   Orange                <NA>       Table,Shelf,Cupboard,Bed,Fridge              Fridge         Table,Shelf,Fridge
2   Apple   Orange                <NA>       Table,Shelf,Cupboard,Bed,Fridge              Fridge         Table,Shelf,Fridge
3  Orange    Lemon              Fridge                    Table,Shelf,Fridge              Fridge           Shelf,Fridge,Bed
4  Orange    Lemon              Fridge                    Table,Shelf,Fridge              Fridge           Shelf,Fridge,Bed
5  Tomato   Potato              Fridge                    Table,Shelf,Fridge                <NA>               Shelf,Fridge
6  Tomato   Potato              Fridge                    Table,Shelf,Fridge                <NA>               Shelf,Fridge
7  Tomato   Potato              Fridge                    Table,Shelf,Fridge                <NA>               Shelf,Fridge
8  Tomato   Potato              Fridge                    Table,Shelf,Fridge                <NA>               Shelf,Fridge

之后：

    P1       P2 P1_location_subacon            P1_location_all_predictors P2_location_subacon P2_location_all_predictors
1   Apple   Orange                <NA>       Table,Shelf,Cupboard,Bed,Fridge              Fridge         Table,Shelf,Fridge
4  Orange    Lemon              Fridge                    Table,Shelf,Fridge              Fridge           Shelf,Fridge,Bed
5  Tomato   Potato              Fridge                    Table,Shelf,Fridge                <NA>               Shelf,Fridge

它会保留哪一行并不重要。那可以随机选择。

【问题讨论】：

昨天的同样问题：***.com/questions/31148152/… 【参考方案1】：

只需在要确保唯一的列子集上使用duplicated()，并使用它来子集主data.frame。例如

dd[ !duplicated(dd[,c("P1","P2")]) , ]

【讨论】：

或unique 与by 选项即。 library(data.table);unique(dd, by = c('P1', 'P2')) @akrun dd 必须是 data.table 才能工作，还是 data.table 也修改了 data.frames 的标准 unique() 函数？它确实适用于 dput 数据。我正在使用 data.table 的开发版本是的，我没想到它会起作用，但我想它是一般的。 dplyr 模拟是 dd %>% distinct(P1,P2)，以防您想锁定所有与此等效的答案。【参考方案2】：

如果 dt 是您的数据框 -

library(data.table)
setDT(dt)

dtFiltered = dt[,
   Flag := .I - min(.I), 
   list(P1,P2)
][
   Flag == 0
]
dtFiltered = dtFiltered[,
  Flag := NULL
]

感谢 Frank 指出我错过了 P2。

【讨论】：

糟糕，是的。谢谢。已更正哦，现在我知道它是如何工作的了。除了标记，您可以只提取标记的行，例如dt[ dt[,head(.I,1),by=.(P1,P2)]$V1 ]，不过，我只使用 data.table 中的unique（上面的 akrun 提到）。【参考方案3】：

试试这个：

dat <- dat[!duplicated(dat[1:2]), ]

【讨论】：

以上是关于从数据框中删除“重复”行（它们在几列中有所不同）[重复]的主要内容，如果未能解决你的问题，请参考以下文章