从数据框中删除“重复”行(它们在几列中有所不同)[重复]
Posted
技术标签:
【中文标题】从数据框中删除“重复”行(它们在几列中有所不同)[重复]【英文标题】:Remove "duplicated" rows from data frame (they differ in few columns) [duplicate] 【发布时间】:2015-09-18 19:29:21 【问题描述】:所以我对这个问题有类似的问题: Remove duplicate rows in R
就我而言,我想保留所有列(不像建议在前 3 列上使用 unique
函数)。如果提到的两个列中的“值”相同,我想只考虑数据框中的 2 列,并且只保留 1 行。
数据如下:
structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L,
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple",
"Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L,
4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange",
"Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(NA,
NA, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"),
P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair",
"Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"),
P2_location_subacon = structure(c(1L, 1L, 1L, 1L, NA, NA,
NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge",
"Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L,
3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed",
"Table,Shelf,Fridge"), class = "factor")), .Names = c("P1",
"P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon",
"P2_location_all_predictors"), row.names = c(NA, -20L), class = "data.frame")
对我来说重要的列是:P1
和 P2
。我想只保留其中一排我们可以吃同样的水果/蔬菜。 (请记住,两列中的水果/蔬菜必须相同):
例子:
之前:
P1 P2 P1_location_subacon P1_location_all_predictors P2_location_subacon P2_location_all_predictors
1 Apple Orange <NA> Table,Shelf,Cupboard,Bed,Fridge Fridge Table,Shelf,Fridge
2 Apple Orange <NA> Table,Shelf,Cupboard,Bed,Fridge Fridge Table,Shelf,Fridge
3 Orange Lemon Fridge Table,Shelf,Fridge Fridge Shelf,Fridge,Bed
4 Orange Lemon Fridge Table,Shelf,Fridge Fridge Shelf,Fridge,Bed
5 Tomato Potato Fridge Table,Shelf,Fridge <NA> Shelf,Fridge
6 Tomato Potato Fridge Table,Shelf,Fridge <NA> Shelf,Fridge
7 Tomato Potato Fridge Table,Shelf,Fridge <NA> Shelf,Fridge
8 Tomato Potato Fridge Table,Shelf,Fridge <NA> Shelf,Fridge
之后:
P1 P2 P1_location_subacon P1_location_all_predictors P2_location_subacon P2_location_all_predictors
1 Apple Orange <NA> Table,Shelf,Cupboard,Bed,Fridge Fridge Table,Shelf,Fridge
4 Orange Lemon Fridge Table,Shelf,Fridge Fridge Shelf,Fridge,Bed
5 Tomato Potato Fridge Table,Shelf,Fridge <NA> Shelf,Fridge
它会保留哪一行并不重要。那可以随机选择。
【问题讨论】:
昨天的同样问题:***.com/questions/31148152/… 【参考方案1】:只需在要确保唯一的列子集上使用duplicated()
,并使用它来子集主data.frame。例如
dd[ !duplicated(dd[,c("P1","P2")]) , ]
【讨论】:
或unique
与by
选项即。 library(data.table);unique(dd, by = c('P1', 'P2'))
@akrun dd
必须是 data.table 才能工作,还是 data.table
也修改了 data.frames 的标准 unique()
函数?
它确实适用于 dput
数据。我正在使用 data.table 的开发版本
是的,我没想到它会起作用,但我想它是一般的。
dplyr
模拟是 dd %>% distinct(P1,P2)
,以防您想锁定所有与此等效的答案。【参考方案2】:
如果 dt 是您的数据框 -
library(data.table)
setDT(dt)
dtFiltered = dt[,
Flag := .I - min(.I),
list(P1,P2)
][
Flag == 0
]
dtFiltered = dtFiltered[,
Flag := NULL
]
感谢 Frank 指出我错过了 P2。
【讨论】:
糟糕,是的。谢谢。已更正 哦,现在我知道它是如何工作的了。除了标记,您可以只提取标记的行,例如dt[ dt[,head(.I,1),by=.(P1,P2)]$V1 ]
,不过,我只使用 data.table 中的unique
(上面的 akrun 提到)。【参考方案3】:
试试这个:
dat <- dat[!duplicated(dat[1:2]), ]
【讨论】:
以上是关于从数据框中删除“重复”行(它们在几列中有所不同)[重复]的主要内容,如果未能解决你的问题,请参考以下文章
在python数据框中删除不包含列中特定字符串的多列的重复项