在一个条件下保持一行数据框
Posted
技术标签:
【中文标题】在一个条件下保持一行数据框【英文标题】:Keep one row of dataframe under a condition 【发布时间】:2019-03-30 11:46:24 【问题描述】:快速提问——
我有一个包含一些重复项的数据框,我想仅在type == 'c1'
时删除它们。因此,例如,我只想为每个 id
保留 一个 行 type == 'c1'
,有没有办法用 dplyr 做到这一点?我正要使用case_when
,但转了一圈。
sample_df <- data.frame(id = c(14129, 14129, 14129, 29102, 29102, 2191, 2191, 2191, 2191, 2192, 2192, 1912, 1912, 1912)
, date = c("2018-06-15 00:15:42","2018-10-08 12:44:44",
"2018-07-09 18:14:58", "2018-06-15 00:15:40",
"2018-06-15 00:19:42", "2018-10-15 08:17:47",
"2018-09-29 10:16:34", "2018-07-09 18:28:25",
"2018-07-09 18:28:25", "2018-07-09 18:20:32",
"2018-08-30 13:06:45", "2018-10-08 11:32:55",
"2018-10-05 11:32:55", "2018-10-08 09:09:56")
, color = c("blue", "blue", "green", "red", "red", "red", "green", "blue", "green", "purple", "blue", "blue", "red", "red")
, day = rep("c1", times = 14)
, happy = c(1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1))
sample_df$date <- as.POSIXct(sample_df$date)
sample_df_2 <- sample_df %>%
gather(key, type, color:day) %>%
mutate(happy = case_when(key == "color" ~ 0, TRUE ~ as.numeric(happy))) %>%
select(-key) %>%
arrange(id)
> sample_df_2
id date happy type
1 1912 2018-10-08 11:32:55 0 blue
2 1912 2018-10-05 11:32:55 0 red
3 1912 2018-10-08 09:09:56 0 red
4 1912 2018-10-08 11:32:55 0 c1
5 1912 2018-10-05 11:32:55 0 c1
6 1912 2018-10-08 09:09:56 1 c1
7 2191 2018-10-15 08:17:47 0 red
8 2191 2018-09-29 10:16:34 0 green
9 2191 2018-07-09 18:28:25 0 blue
10 2191 2018-07-09 18:28:25 0 green
11 2191 2018-10-15 08:17:47 1 c1
12 2191 2018-09-29 10:16:34 0 c1
13 2191 2018-07-09 18:28:25 1 c1
14 2191 2018-07-09 18:28:25 0 c1
15 2192 2018-07-09 18:20:32 0 purple
16 2192 2018-08-30 13:06:45 0 blue
17 2192 2018-07-09 18:20:32 0 c1
18 2192 2018-08-30 13:06:45 1 c1
19 14129 2018-06-15 00:15:42 0 blue
20 14129 2018-10-08 12:44:44 0 blue
21 14129 2018-07-09 18:14:58 0 green
22 14129 2018-06-15 00:15:42 1 c1
23 14129 2018-10-08 12:44:44 0 c1
24 14129 2018-07-09 18:14:58 0 c1
25 29102 2018-06-15 00:15:40 0 red
26 29102 2018-06-15 00:19:42 0 red
27 29102 2018-06-15 00:15:40 0 c1
28 29102 2018-06-15 00:19:42 1 c1
想要的输出——
id date happy type
1 1912 2018-10-08 11:32:55 0 blue
2 1912 2018-10-05 11:32:55 0 red
3 1912 2018-10-08 09:09:56 0 red
4 1912 2018-10-08 11:32:55 0 c1
7 2191 2018-10-15 08:17:47 0 red
8 2191 2018-09-29 10:16:34 0 green
9 2191 2018-07-09 18:28:25 0 blue
10 2191 2018-07-09 18:28:25 0 green
11 2191 2018-10-15 08:17:47 1 c1
15 2192 2018-07-09 18:20:32 0 purple
16 2192 2018-08-30 13:06:45 0 blue
17 2192 2018-07-09 18:20:32 0 c1
19 14129 2018-06-15 00:15:42 0 blue
20 14129 2018-10-08 12:44:44 0 blue
21 14129 2018-07-09 18:14:58 0 green
22 14129 2018-06-15 00:15:42 1 c1
25 29102 2018-06-15 00:15:40 0 red
26 29102 2018-06-15 00:19:42 0 red
27 29102 2018-06-15 00:15:40 0 c1
【问题讨论】:
anyDuplicated(sample_df)
和 anyDuplicated(sample_df_2)
都表示您的数据中没有重复项。您的意思是在您的“重复”声明中忽略 date
和 happy
?
【参考方案1】:
基础R
sample_df_2[ !duplicated(sample_df_2[c("id","type")]) | sample_df_2$type != "c1", ]
# id date happy type
# 1 1912 2018-10-08 11:32:55 0 blue
# 2 1912 2018-10-05 11:32:55 0 red
# 3 1912 2018-10-08 09:09:56 0 red
# 4 1912 2018-10-08 11:32:55 0 c1
# 7 2191 2018-10-15 08:17:47 0 red
# 8 2191 2018-09-29 10:16:34 0 green
# 9 2191 2018-07-09 18:28:25 0 blue
# 10 2191 2018-07-09 18:28:25 0 green
# 11 2191 2018-10-15 08:17:47 1 c1
# 15 2192 2018-07-09 18:20:32 0 purple
# 16 2192 2018-08-30 13:06:45 0 blue
# 17 2192 2018-07-09 18:20:32 0 c1
# 19 14129 2018-06-15 00:15:42 0 blue
# 20 14129 2018-10-08 12:44:44 0 blue
# 21 14129 2018-07-09 18:14:58 0 green
# 22 14129 2018-06-15 00:15:42 1 c1
# 25 29102 2018-06-15 00:15:40 0 red
# 26 29102 2018-06-15 00:19:42 0 red
# 27 29102 2018-06-15 00:15:40 0 c1
Tidyverse:
library(dplyr)
sample_df_2 %>%
filter(!duplicated(cbind(id,type)) | type != "c1")
# id date happy type
# 1 1912 2018-10-08 11:32:55 0 blue
# 2 1912 2018-10-05 11:32:55 0 red
# 3 1912 2018-10-08 09:09:56 0 red
# 4 1912 2018-10-08 11:32:55 0 c1
# 5 2191 2018-10-15 08:17:47 0 red
# 6 2191 2018-09-29 10:16:34 0 green
# 7 2191 2018-07-09 18:28:25 0 blue
# 8 2191 2018-07-09 18:28:25 0 green
# 9 2191 2018-10-15 08:17:47 1 c1
# 10 2192 2018-07-09 18:20:32 0 purple
# 11 2192 2018-08-30 13:06:45 0 blue
# 12 2192 2018-07-09 18:20:32 0 c1
# 13 14129 2018-06-15 00:15:42 0 blue
# 14 14129 2018-10-08 12:44:44 0 blue
# 15 14129 2018-07-09 18:14:58 0 green
# 16 14129 2018-06-15 00:15:42 1 c1
# 17 29102 2018-06-15 00:15:40 0 red
# 18 29102 2018-06-15 00:19:42 0 red
# 19 29102 2018-06-15 00:15:40 0 c1
【讨论】:
【参考方案2】:使用dplyr
:
sample_df_2 %>%
group_by(id) %>%
filter(!duplicated(type) | type!="c1")
【讨论】:
以上是关于在一个条件下保持一行数据框的主要内容,如果未能解决你的问题,请参考以下文章
在熊猫多索引数据框中返回满足逻辑索引条件的每个组的最后一行[重复]
为 pyspark 数据帧的每一行评估多个 if elif 条件
比较两个(py)spark sql数据框并在保持连接列的同时有条件地选择列数据