如何在R中排除彼此靠近的相同类别日期？

Posted 2023-02-14

技术标签:

【中文标题】如何在R中排除彼此靠近的相同类别日期？【英文标题】：How to exclude same category dates near each other in R? 【发布时间】：2022-01-01 00:23:16 【问题描述】：

我正在使用由不同类别的事件日期组成的日期框架。我的目标是排除与先前观察相比出现在 80 天内的同一类别的日期。例如：

observation Date       category

1           2015-05-05 green

2           2015-06-08 green #(this should be excluded)

3           2015-09-30 green

4           2014-06-30 red

5           2014-07-30 red    #(this should be excluded)

6           2014-09-30 red    #(this should not be excluded, since it is +80 days from the first obs.)

7           2013-01-01 blue

8           2013-02-01 blue   #(this should be excluded)

9           2013-06-01 blue
                     
10          2013-07-01 blue   #(this should be excluded)

与之前的观察相比，我试图排除 80 天内出现的同一类别的日期（因为在我的研究中，这被认为与之前的观察相同）。然而，我希望排除可能在观察后 80 天内出现的观察结果。例如，观察 6 将出现在观察 5 的 80 天内，应排除，因为它出现在 obs 的 80 天内。 4这是第一个obs。在那个类别中。或许你会明白我的目标是什么：D

我想我可以通过 group_by 函数来做到这一点，然后计算所有门中每个观察之间的天数差异。然而问题在于，在我的示例中，它还会排除像观察 6 这样的日期。

我非常感谢有关如何以最聪明的方式做到这一点的提示。我尝试搜索以前的主题，但找不到任何有用的内容。

问候阿列克西

编辑：使用 Merijn van Tilborgs 代码的结果示例：

Date       Diff_days remove1 remove2
2015-06-29 119       FALSE FALSE
2015-07-09 7         FALSE TRUE
2015-07-15 6         FALSE TRUE
2015-08-18 34        FALSE TRUE
2015-10-03 46        FALSE TRUE

在此示例中，应保存最后一次观察，因为距离实际保存的最后一次观察 (2015-06-29) 有 +80 天。

Edit2：benimwolfspelz 提出的迭代策略：我用来计算同一类别中每个观察之间的天数的代码： df2 %

arrange(Date) %>%

          group_by(category) %>%

          mutate(diff_date = c(0,diff(Date)))

【问题讨论】：

那么，一个观察是否应该被排除不仅取决于它之前的观察，对吧？不应排除 Nr 6，因为 Nr 5 已经存在，并且您想以“及时”的方式执行此操作，对吗？也许编写一个只排除第一个要排除的观察的函数，然后迭代地应用它（while）直到没有这正是我想要做的 :) 我会试试的。所以现在我设法创建了一个新列，该列在其自己的类别中的每个观察之间都有天数。我想我可以排除每个类别中少于 80 天的第二次观察，但是我到底如何指定代码，我只查看每个类别中的第二次观察。您能否编辑您的问题以包含创建此新列的代码？现在编辑了:) 【参考方案1】：

我稍微扩展了测试集，但如果我错过了某些情况，您必须对其进行测试。也许可以一步完成，但我只设法在两个临时删除列中完成。

dt <- structure(list(Date = structure(c(16560, 16594, 16708, 16251, 
16281, 16343, 15706, 15737, 15857, 15887, 15888, 16252, 16617, 
16648), class = "Date"), category = c("green", "green", "green", 
"red", "red", "red", "blue", "blue", "blue", "blue", "blue", 
"blue", "blue", "blue"), message = c(NA, "this should be excluded", 
NA, NA, "this should be excluded", "this should not be excluded", 
NA, "this should be excluded", NA, "this should be excluded", 
"this should be excluded", "this should not be excluded", "this should not be excluded", 
NA)), row.names = c(NA, -14L), spec = structure(list(cols = list(
    Date = structure(list(format = ""), class = c("collector_date", 
    "collector")), category = structure(list(), class = c("collector_character", 
    "collector")), message = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1L), class = "col_spec"), class = c( 
"data.frame"))

library(lubridate)
library(data.table)
setDT(dt)

dt[, remove1 := lead(Date) - lag(Date) > days(80) & Date - lag(Date) < days(80), by = category][is.na(remove1), remove1 := F]
dt[, remove2 := lag(remove1) == F & Date - lag(Date) < days(80), by = category][is.na(remove2), remove2 := F]

dt

#           Date category                     message remove1 remove2
#  1: 2015-05-05    green                        <NA>   FALSE   FALSE
#  2: 2015-06-08    green     this should be excluded    TRUE    TRUE
#  3: 2015-09-30    green                        <NA>   FALSE   FALSE
#  4: 2014-06-30      red                        <NA>   FALSE   FALSE
#  5: 2014-07-30      red     this should be excluded    TRUE    TRUE
#  6: 2014-09-30      red this should not be excluded   FALSE   FALSE
#  7: 2013-01-01     blue                        <NA>   FALSE   FALSE
#  8: 2013-02-01     blue     this should be excluded    TRUE    TRUE
#  9: 2013-06-01     blue                        <NA>   FALSE   FALSE
# 10: 2013-07-01     blue     this should be excluded   FALSE    TRUE
# 11: 2013-07-02     blue     this should be excluded    TRUE    TRUE
# 12: 2014-07-01     blue this should not be excluded   FALSE   FALSE
# 13: 2015-07-01     blue this should not be excluded   FALSE   FALSE
# 14: 2015-08-01     blue                        <NA>   FALSE    TRUE

dt[!(remove1 == T | remove2 == T)]

#          Date category                     message remove1 remove2
# 1: 2015-05-05    green                        <NA>   FALSE   FALSE
# 2: 2015-09-30    green                        <NA>   FALSE   FALSE
# 3: 2014-06-30      red                        <NA>   FALSE   FALSE
# 4: 2014-09-30      red this should not be excluded   FALSE   FALSE
# 5: 2013-01-01     blue                        <NA>   FALSE   FALSE
# 6: 2013-06-01     blue                        <NA>   FALSE   FALSE
# 7: 2014-07-01     blue this should not be excluded   FALSE   FALSE
# 8: 2015-07-01     blue this should not be excluded   FALSE   FALSE

【讨论】：

感谢您的帮助！代码对于我的 R 技能来说似乎太高级了，我完全不明白，但我决定试一试。代码似乎在我们的示例中运行良好，但是当使用我的数据框运行它时，我会收到以下错误：“天数错误（80）：找不到函数“天”。也许与日期数据类型有关？跨度> 抱歉，我忘了说它是 lubridate 包的一部分现在我运行它时它工作了。出色的工作，非常感谢。当我检查我的日期框架时，我注意到其中的一个系统错误，它似乎出现在短时间内有很多观察结果时。我在原始帖子中发布了一个示例。

以上是关于如何在R中排除彼此靠近的相同类别日期？的主要内容，如果未能解决你的问题，请参考以下文章