我们如何检查唯一 ID 的任意 2 个间隔是不是重叠？

Posted 2023-02-14

技术标签:

【中文标题】我们如何检查唯一 ID 的任意 2 个间隔是不是重叠？【英文标题】：How can we check if any 2 intervals of a unique ID overlaps?我们如何检查唯一 ID 的任意 2 个间隔是否重叠？ 【发布时间】：2021-10-12 23:05:20 【问题描述】：

我有患者口服DM药物处方数据，即DPP4和SU，想了解患者是否同时服用药物（即同一患者内DPP4和SU是否存在重叠区间ID )。

样本数据：

  ID DRUG      START        END
1  1 DPP4 2020-01-01 2020-01-20
2  1 DPP4 2020-03-01 2020-04-01
3  1   SU 2020-03-15 2020-04-30
4  2   SU 2020-10-01 2020-10-31
5  2 DPP4 2020-12-01 2020-12-31

在上面的示例数据中，

ID == 1

2020-03-15

2020-04-01

ID == 2

我想将数据分成 2 份，一份用于 DPP4，另一份用于 SU。然后，进行完全连接，并将每个 DPP4 间隔与每个 SU 间隔进行比较。这对于小数据可能没问题，但如果患者有 5 行 DPP4 和另外 5 行 SU，我们将进行 25 次比较，这可能效率不高。加上超过 10000 名患者。

我不知道该怎么做。

新数据：

希望有一个像这样的新df。 或者任何整洁的东西。

  ID    DRUG      START        END
1  1 DPP4-SU 2020-03-15 2020-04-01
2  2    <NA>       <NA>       <NA>

数据代码：

df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), DRUG = c("DPP4", "DPP4", 
"SU", "SU", "DPP4"), START = structure(c(18262, 18322, 18336, 
18536, 18597), class = "Date"), END = structure(c(18281, 18353, 
18382, 18566, 18627), class = "Date")), class = "data.frame", row.names = c(NA, 
-5L))

df_new <- structure(list(ID = 1:2, DRUG = c("DPP4-SU", NA), START = structure(c(18336, 
NA), class = "Date"), END = structure(c(18353, NA), class = "Date")), class = "data.frame", row.names = c(NA, 
-2L))

编辑： 我认为从我给出的样本数据来看，似乎只能有 1 个相交区间。但可能还有更多。所以，我认为这将是更好的数据来说明。

structure(list(ID = c(3, 3, 3, 3, 3, 3, 3), DRUG = c("DPP4", 
"DPP4", "SU", "SU", "DPP4", "DPP4", "DPP4"), START = structure(c(17004, 
17383, 17383, 17418, 17437, 17649, 17676), class = c("IDate", 
"Date")), END = structure(c(17039, 17405, 17405, 17521, 17625, 
17669, 17711), class = c("IDate", "Date")), duration = c(35L, 
22L, 22L, 103L, 188L, 20L, 35L), INDEX = c(1L, 0L, 0L, 0L, 0L, 
0L, 0L)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", 
"data.frame"))

【问题讨论】：

你只有2种药吗？不超过 2 个？是的，只有 2 种药物。但患者可能对每种药物有多个处方间隔。并且每种药物可能有不同的编号。处方。你可以看到我的更新 【参考方案1】：

它比亲爱的@AnoushiravanR 复杂得多，但作为替代方案，您可以尝试

library(dplyr)
library(tidyr)
library(lubridate)

df %>% 
  full_join(x = ., y = ., by = "ID") %>% 
#  filter(DRUG.x != DRUG.y | START.x != START.y | END.x != END.y) %>% 
  filter(DRUG.x != DRUG.y) %>%
  group_by(ID, intersection = intersect(interval(START.x, END.x), interval(START.y, END.y))) %>% 
  drop_na(intersection) %>% 
  filter(START.x == first(START.x)) %>% 
  summarise(DRUG  = paste(DRUG.x, DRUG.y, sep = "-"),
            START = as_date(int_start(intersection)),
            END   = as_date(int_end(intersection)),
            .groups = "drop") %>% 
  select(-intersection)

# A tibble: 1 x 4
     ID DRUG    START      END       
  <int> <chr>   <date>     <date>    
1     1 DPP4-SU 2020-03-15 2020-04-01

编辑：更改了过滤条件。前一个是有缺陷的。

【讨论】：

谢谢你亲爱的马丁，我想我在这里遗漏了一些东西，但不知道它可能是什么！你的听起来更安全！我在想“如果 DPP4 的 START 和 END 日期不重叠，但 SU 与两者都重叠，会发生什么？”。还是可以将答案推广到两种以上的药物？更大的数据集将是一个很好的测试场景。是的，我同意。它可以推广到两种以上的药物，但关于你的第一个假设，在提供更大的数据集之前我无法确定。谢谢！这看起来不错，我尝试将group_by 之前的第一行更改为df %>% split(., .$DRUG) %>% full_join(x = .$DPP4, y = .$SU, by = "ID")【参考方案2】：

更新的解决方案 我根据新提供的数据集进行了相当大的修改。这次我首先为每个 START 和 END 对创建了间隔，并提取了它们之间的相交周期。由于亲爱的 Martin 很好地利用了它们，我们可以使用 lubridate::int_start 和 lubridate::int_end 来提取每个间隔的 START 和 END 日期：

library(dplyr)
library(lubridate)
library(purrr)
library(tidyr)

df %>%
  group_by(ID) %>%
  arrange(START, END) %>%
  mutate(int = interval(START, END),
         is_over = c(NA, map2(int[-n()], int[-1], 
                              ~ intersect(.x, .y)))) %>%
  unnest(cols = c(is_over)) %>% 
  select(-int) %>%
  filter(!is.na(is_over) | !is.na(lead(is_over))) %>%
  select(!c(START, END)) %>%
  mutate(grp = cumsum(is.na(is_over))) %>%
  group_by(grp) %>%
  summarise(ID = first(ID), 
            DRUG = paste0(DRUG, collapse = "-"), 
            is_over = na.omit(is_over)) %>%
  mutate(START = int_start(is_over), 
         END = int_end(is_over)) %>%
  select(!is_over)

# A tibble: 1 x 5
    grp    ID DRUG    START               END                
  <int> <int> <chr>   <dttm>              <dttm>             
1     1     1 DPP4-SU 2020-03-15 00:00:00 2020-04-01 00:00:00

第二个数据集：

# A tibble: 2 x 5
    grp    ID DRUG    START               END                
  <int> <dbl> <chr>   <dttm>              <dttm>             
1     1     3 DPP4-SU 2017-08-05 00:00:00 2017-08-27 00:00:00
2     2     3 SU-DPP4 2017-09-28 00:00:00 2017-12-21 00:00:00

【讨论】：

很好的答案，点赞！为了安全起见，我认为最好先按 START 和 END 升序排列行。谢谢你亲爱的托马斯，我采纳了你非常好的建议。你认为我在这里遗漏了什么吗？也许在一个更大的数据集中我没有考虑到一个场景？我认为这取决于 OP 的需求 :) 是的，让我们看看结果如何。一个不错的基础 R 解决方案将是一个很棒的补充 :) 我添加了一个基本的 R 选项，但似乎一点也不花哨：P【参考方案3】：

更新

根据更新的df

df <- structure(list(ID = c(3, 3, 3, 3, 3, 3, 3), DRUG = c(
  "DPP4",
  "DPP4", "SU", "SU", "DPP4", "DPP4", "DPP4"
), START = structure(c(
  17004,
  17383, 17383, 17418, 17437, 17649, 17676
), class = c(
  "IDate",
  "Date"
)), END = structure(c(
  17039, 17405, 17405, 17521, 17625,
  17669, 17711
), class = c("IDate", "Date")), duration = c(
  35L,
  22L, 22L, 103L, 188L, 20L, 35L
), INDEX = c(
  1L, 0L, 0L, 0L, 0L,
  0L, 0L
)), row.names = c(NA, -7L), class = c(
  "tbl_df", "tbl",
  "data.frame"
))

我们得到

> dfnew
    ID    DRUG      start        end
3.3  3 DPP4-SU 2017-08-05 2017-08-27
3.7  3 SU-DPP4 2017-09-28 2017-12-21

一个基本的 R 选项（不像 @Anoushiravan R 或 @Martin Gal 的答案那么花哨）

f <- function(d) 
  d <- d[with(d, order(START, END)), ]
  idx <- subset(
    data.frame(which((u <- with(d, outer(START, END, `<`))) & t(u), arr.ind = TRUE)),
    row > col
  )
  if (nrow(idx) == 0) 
    return(data.frame(ID = unique(d$ID), DRUG = NA, start = NA, end = NA))
  
  with(
    d,
    do.call(rbind, 
    apply(
      idx,
      1,
      FUN = function(v) 
        data.frame(
          ID = ID[v["row"]],
          DRUG = paste0(DRUG[sort(unlist(v))], collapse = "-"),
          start = START[v["row"]],
          end = END[v["col"]]
        )
      
    ))
  )


dfnew <- do.call(rbind, Map(f, split(df, ~ID)))

给予

> dfnew
  ID    DRUG      start        end
1  1 DPP4-SU 2020-03-15 2020-04-01
2  2    <NA>       <NA>       <NA>

【讨论】：

已经全部赞成，但你为什么不在 igraph 中试试呢？ @AnilGoyal 谢谢。我不知道如何在igraph 中解决它，因为我们需要找到相交的日期间隔。你有什么线索吗？恕我直言，基本 R 解决方案比使用高度开发的包更花哨。 +1 @MartinGal 谢谢。其实我需要学习那些高度发达的包，我对它们知之甚少 @AnoushiravanR 哈哈，我的运气:)【参考方案4】：

您可以使用与上述答案略有不同的方法，但这会给您提供不同于所需格式的结果。显然，这些可以joined 以获得预期的结果。你可以试试这个

df <- structure(list(ID = c(3, 3, 3, 3, 3, 3, 3), DRUG = c("DPP4",  "DPP4", "SU", "SU", "DPP4", "DPP4", "DPP4"), START = structure(c(17004,  17383, 17383, 17418, 17437, 17649, 17676), class = c("IDate",  "Date")), END = structure(c(17039, 17405, 17405, 17521, 17625,  17669, 17711), class = c("IDate", "Date"))), row.names = c(NA,  -7L), class = c("tbl_df", "tbl", "data.frame"))

df
#> # A tibble: 7 x 4
#>      ID DRUG  START      END       
#>   <dbl> <chr> <date>     <date>    
#> 1     3 DPP4  2016-07-22 2016-08-26
#> 2     3 DPP4  2017-08-05 2017-08-27
#> 3     3 SU    2017-08-05 2017-08-27
#> 4     3 SU    2017-09-09 2017-12-21
#> 5     3 DPP4  2017-09-28 2018-04-04
#> 6     3 DPP4  2018-04-28 2018-05-18
#> 7     3 DPP4  2018-05-25 2018-06-29
library(tidyverse)

df %>%
  mutate(treatment_id = row_number()) %>%
  pivot_longer(c(START, END), names_to = 'event', values_to = 'dates') %>%
  mutate(event = factor(event, levels = c('END', 'START'), ordered = TRUE)) %>%
  group_by(ID) %>%
  arrange(dates, event, .by_group = TRUE) %>%
  mutate(overlap = cumsum(ifelse(event == 'START', 1, -1))) %>%
  filter((overlap > 1 & event == 'START') | (overlap > 0 & event == 'END'))
#> # A tibble: 4 x 6
#> # Groups:   ID [1]
#>      ID DRUG  treatment_id event dates      overlap
#>   <dbl> <chr>        <int> <ord> <date>       <dbl>
#> 1     3 SU               3 START 2017-08-05       2
#> 2     3 DPP4             2 END   2017-08-27       1
#> 3     3 DPP4             5 START 2017-09-28       2
#> 4     3 SU               4 END   2017-12-21       1

关于最初提供的数据

# A tibble: 2 x 6
# Groups:   ID [1]
     ID DRUG  treatment_id event dates      overlap
  <int> <chr>        <int> <ord> <date>       <dbl>
1     1 SU               3 START 2020-03-15       2
2     1 DPP4             2 END   2020-04-01       1

为了转换/获得原始形状的结果，您可以过滤重叠的行

library(tidyverse)
df_new <- structure(list(ID = c(3, 3, 3, 3, 3, 3, 3), DRUG = c("DPP4",  "DPP4", "SU", "SU", "DPP4", "DPP4", "DPP4"), START = structure(c(17004,  17383, 17383, 17418, 17437, 17649, 17676), class = c("IDate",  "Date")), END = structure(c(17039, 17405, 17405, 17521, 17625,  17669, 17711), class = c("IDate", "Date"))), row.names = c(NA,  -7L), class = c("tbl_df", "tbl", "data.frame"))



df_new %>%
  mutate(treatment_id = row_number()) %>%
  pivot_longer(c(START, END), names_to = 'event', values_to = 'dates') %>%
  mutate(event = factor(event, levels = c('END', 'START'), ordered = TRUE)) %>%
  group_by(ID) %>%
  arrange(dates, event, .by_group = TRUE) %>%
  mutate(overlap = cumsum(ifelse(event == 'START', 1, -1))) %>%
  filter((overlap > 1 & event == 'START') | (overlap > 0 & event == 'END')) %>%
  left_join(df_new %>% mutate(treatment_id = row_number()), by = c('ID', 'DRUG', 'treatment_id'))
#> # A tibble: 4 x 8
#> # Groups:   ID [1]
#>      ID DRUG  treatment_id event dates      overlap START      END       
#>   <dbl> <chr>        <int> <ord> <date>       <dbl> <date>     <date>    
#> 1     3 SU               3 START 2017-08-05       2 2017-08-05 2017-08-27
#> 2     3 DPP4             2 END   2017-08-27       1 2017-08-05 2017-08-27
#> 3     3 DPP4             5 START 2017-09-28       2 2017-09-28 2018-04-04
#> 4     3 SU               4 END   2017-12-21       1 2017-09-09 2017-12-21

^{由reprex package (v2.0.0) 于 2021-08-10 创建}

【讨论】：

太棒了！只需按照我的方式合并日期即可。

以上是关于我们如何检查唯一 ID 的任意 2 个间隔是不是重叠？的主要内容，如果未能解决你的问题，请参考以下文章