根据前 n 行有条件地创建新列
Posted
技术标签:
【中文标题】根据前 n 行有条件地创建新列【英文标题】:Creating a new column conditionally based on previous n rows 【发布时间】:2020-03-16 18:01:30 【问题描述】:我有一个如下设置的数据框:
df <- data.frame("id" = c(111,111,111,222,222,222,222,333,333,333,333),
"Location" = c("A","B","A","A","C","B","A","B","A","A","A"),
"Encounter" = c(1,2,3,1,2,3,4,1,2,3,4))
id Location Encounter
1 111 A 1
2 111 B 2
3 111 A 3
4 222 A 1
5 222 C 2
6 222 B 3
7 222 A 4
8 333 B 1
9 333 A 2
10 333 B 3
11 333 A 4
我基本上是在尝试为每个 id 组创建一个位置在先前遭遇中的二进制标志。所以它看起来像:
id Location Encounter Flag
1 111 A 1 0
2 111 B 2 0
3 111 A 3 1
4 222 A 1 0
5 222 C 2 0
6 222 B 3 0
7 222 A 4 1
8 333 B 1 0
9 333 A 2 0
10 333 B 3 1
11 333 A 4 1
我试图弄清楚如何执行 if 语句,例如:
library(dplyr)
df$Flag <- case_when((df$id - lag(df$id)) == 0 ~
case_when(df$Location == lag(df$Location, 1) |
df$Location == lag(df$Location, 2) |
df$Location == lag(df$Location, 3) ~ 1, T ~ 0), T ~ 0)
id Location Flag
1 111 A 0
2 111 B 0
3 111 A 1
4 222 A 0
5 222 C 0
6 222 B 0
7 222 A 1
8 333 B 0
9 333 A 1
10 333 B 1
11 333 A 1
但这存在第 9 行被错误地分配为 1 的问题,并且在实际数据中存在 15 次以上遭遇的情况,因此这变得非常麻烦。我希望找到一种方法来做类似的事情
lag(df$Location, 1:df$Encounter)
但我知道lag()
需要一个整数来表示 k,因此该特定命令不起作用。
【问题讨论】:
【参考方案1】:你也可以这样用:
library(data.table)
setDT(df)[,flag:=ifelse(1:.N>1,1,0),by=.(id,Location)]
【讨论】:
【参考方案2】:更通用的data.table
解决方案是使用.N
或rowid
:
library(data.table)
setDT(dt)[, Flag := +(rowid(id, Location)>1)][]
或
setDT(df)[, Flag := +(seq_len(.N)>1), .(id, Location)][]
#> id Location Encounter Flag
#> 1: 111 A 1 0
#> 2: 111 B 2 0
#> 3: 111 A 3 1
#> 4: 222 A 1 0
#> 5: 222 C 2 0
#> 6: 222 B 3 0
#> 7: 222 A 4 1
#> 8: 333 B 1 0
#> 9: 333 A 2 0
#> 10: 333 A 3 1
#> 11: 333 A 4 1
【讨论】:
【参考方案3】:使用data.table
:
library(data.table)
dt[, flag:=1]
dt[, flag:=cumsum(flag), by=.(id,Location)]
dt[, flag:=ifelse(flag>1,1,0)]
数据:
dt <- data.table("id" = c(111,111,111,222,222,222,222,333,333,333,333),
"Location" = c("A","B","A","A","C","B","A","B","A","A","A"),
"Encounter" = c(1,2,3,1,2,3,4,1,2,3,4))
【讨论】:
【参考方案4】:在base R中,我们可以使用ave
按id
和Location
分组,并将组第二行的所有值都变为1。
df$Flag <- as.integer(with(df, ave(Encounter, id, Location, FUN = seq_along) > 1))
df
# id Location Encounter Flag
#1 111 A 1 0
#2 111 B 2 0
#3 111 A 3 1
#4 222 A 1 0
#5 222 C 2 0
#6 222 B 3 0
#7 222 A 4 1
#8 333 B 1 0
#9 333 A 2 0
#10 333 A 3 1
#11 333 A 4 1
使用dplyr
,那就是
library(dplyr)
df %>% group_by(id, Location) %>% mutate(Flag = as.integer(row_number() > 1))
【讨论】:
【参考方案5】:duplicated
的选项
library(dplyr)
df %>%
group_by(id) %>%
mutate(Flag = +(duplicated(Location)))
# A tibble: 11 x 4
# Groups: id [3]
# id Location Encounter Flag
# <dbl> <fct> <dbl> <int>
# 1 111 A 1 0
# 2 111 B 2 0
# 3 111 A 3 1
# 4 222 A 1 0
# 5 222 C 2 0
# 6 222 B 3 0
# 7 222 A 4 1
# 8 333 B 1 0
# 9 333 A 2 0
#10 333 A 3 1
#11 333 A 4 1
【讨论】:
以上是关于根据前 n 行有条件地创建新列的主要内容,如果未能解决你的问题,请参考以下文章