将日期分成几个块,以YYYY-12-31结尾
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了将日期分成几个块,以YYYY-12-31结尾相关的知识,希望对你有一定的参考价值。
df <- data.frame(group = c("a", "a", "b", "b"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02"))
假设我有以下df:
group start end
1 a 2017-05-01 2018-09-01
2 a 2019-04-03 2020-04-03
3 b 2011-03-03 2012-05-03
4 b 2014-05-07 2016-04-02
我想把它变成这种格式,每条记录分为开始日期和后续年份的31/12:
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
关于如何解决这个问题的任何想法?
编辑:
我主要关心的不是同一年内的日期范围。然而,正如chinsoon12指出的那样,如果方法也可以处理它们确实会有所帮助,例如在这个数据集中:
df <- data.frame(group = c("a", "a", "b", "b", "c"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07", "2017-02-01"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02", "2017-04-05"))
最终结果将保留最后一行:
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
10 c 2017-02-01 2017-04-05
答案
data.table可能的解决方案:
library(data.table)
setDT(df)
df[df[, rep(.I, 1 + year(end) - year(start))]
][, `:=` (start = pmax(start[1], as.Date(paste0(year(start[1]) + 0:(.N-1), '-01-01'))),
end = pmin(end[.N], as.Date(paste0(year(end[.N]) - (.N-1):0, '-12-31'))))
, by = .(group, rleid(start))][]
这使:
group start end 1: a 2017-05-01 2017-12-31 2: a 2018-01-01 2018-09-01 3: a 2019-04-03 2019-12-31 4: a 2020-01-01 2020-04-03 5: b 2011-03-03 2011-12-31 6: b 2012-01-01 2012-05-03 7: b 2014-05-07 2014-12-31 8: b 2015-01-01 2015-12-31 9: b 2016-01-01 2016-04-02 10: c 2017-02-01 2017-04-05
使用data.table的两种替代解决方案:
# alternative 1:
df[, ri := rowid(group)
][df[, rep(.I, 1 + year(end) - year(start))]
][, `:=` (start = if (.N == 1) start else c(start[1], as.Date(paste0(year(start[1]) + 1:(.N-1), '-01-01') )),
end = if (.N == 1) end else c(as.Date(paste0(year(end[.N]) - (.N-1):1, '-12-31') ), end[.N]))
, by = .(group, ri)][, ri := NULL][]
# alternative 2:
df[, ri := rowid(group)
][df[, rep(.I, 1 + year(end) - year(start))]
][, `:=` (start = pmax(start[1], as.Date(paste0(year(start[1]) + 0:(.N-1), '-01-01'))),
end = pmin(end[.N], as.Date(paste0(year(end[.N]) - (.N-1):0, '-12-31'))))
, by = .(group, ri)][, ri := NULL][]
使用数据:
df <- data.frame(group = c("a", "a", "b", "b", "c"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07", "2017-02-01"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02", "2017-04-05"))
df[2:3] <- lapply(df[2:3], as.Date)
另一答案
这是一个no-tidyverse / no-data.table版本:
df <- data.frame(group = c("a", "a", "b", "b"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02"), stringsAsFactors=FALSE)
# added stringsAsFactors =FALSE to your df for sanity
# reformatting start and end as Date
df$start <- as.Date(df$start)
df$end <- as.Date(df$end)
dfs <- split(df, rownames(df))
# split the data frame by rows
res <- do.call(rbind, lapply(dfs, function(.){
s <- seq(from=.$start, to=.$end, by="day")
# sequence form df$start to df$end, by days
y <- format(s, "%Y")
# years of that sequence
s2 <- as.character(s)
# formatting s as character -- otherwise sapply will get rid of the
# Date class and the result will look as numeric
ys <- split(s2,y)
# split the sequence by years
data.frame(group=.$group, start=sapply(ys, head,1), end = sapply(ys, tail, 1), stringsAsFactors=FALSE)
# take the first and last element from each "sub-vector" of the split sequence
}))
rownames(res) <- NULL # kill the nasty rownames
res
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
请注意,结果将start
和end
列作为character
,就像它在原始数据框中一样。
我很抱歉基地R对待Date(和POSIXct)对象的方式 - 你永远不知道他们什么时候会失去他们的课程并成为简单的数字。在这里,我通过将日期视为字符来避免这个“特征”,除非需要日期操作,例如创建日期序列。
另一答案
library(tidyverse)
library(lubridate)
df%>%
mutate(end=as.Date(end),
start=as.Date(start),
diff=Map(":",0,1+year(end)-year(start)-1))%>%
unnest()%>%
mutate(end=pmin(end,as.Date(paste0(year(start)+diff,"-12-31"))),
start=pmax(start,as.Date(paste0(year(start)+diff,"-1-1"))),
diff=NULL)
A tibble: 9 x 3
group start end
<fct> <date> <date>
1 a 2017-05-02 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2020-01-01 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2015-01-01 2016-04-02
使用更新的数据只需运行您将获得的确切功能:
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
10 c 2017-02-01 2017-04-05
以上是关于将日期分成几个块,以YYYY-12-31结尾的主要内容,如果未能解决你的问题,请参考以下文章
提取以“st”、“nd”、“rd”、“th”结尾的日期,同时使用 RegEx 将日期与月份交换
将日期分为几年,几个月,几天,几小时的单独部分。 Java的