R中的累积访问时间序列图
Posted
技术标签:
【中文标题】R中的累积访问时间序列图【英文标题】:Cumulative visit time series plot in R 【发布时间】:2022-01-13 12:20:04 【问题描述】:我有一个大型数据集,它在特定位置全天记录 id
我想做的是绘制每个人id
在收集数据的时间段内的累计访问次数。
数据样本如下所示,完整数据集已访问数天。
我尝试了一些使用 cumsum
的变体,但无法正常工作。
dput(df)
structure(list(date = c("06/01/2021", "06/01/2021", "06/01/2021",
"06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021",
"06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021",
"06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021",
"06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021",
"06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021",
"06/01/2021", "06/01/2021", "07/01/2021", "07/01/2021", "07/01/2021",
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021",
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021",
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021",
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021",
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021",
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021",
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021",
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021",
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021",
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021",
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021",
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021",
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021",
"07/01/2021", "07/01/2021", "08/01/2021", "08/01/2021", "08/01/2021",
"08/01/2021", "08/01/2021", "08/01/2021", "08/01/2021", "08/01/2021",
"08/01/2021", "08/01/2021", "08/01/2021", "08/01/2021", "08/01/2021",
"08/01/2021"), time = c("08:02:54", "08:04:48", "08:04:49", "08:05:49",
"08:05:50", "08:05:50", "08:05:51", "08:06:32", "08:06:33", "08:07:34",
"08:07:34", "08:07:35", "08:07:36", "08:07:36", "08:09:52", "08:09:53",
"08:09:53", "08:10:02", "08:10:04", "08:10:05", "08:10:05", "08:10:07",
"08:10:08", "08:10:22", "08:10:42", "08:10:43", "08:11:14", "08:11:15",
"08:11:38", "08:11:39", "08:11:39", "08:11:40", "08:11:40", "08:11:41",
"08:11:48", "08:11:50", "08:11:51", "08:11:51", "08:11:52", "08:11:53",
"08:11:54", "08:11:54", "08:12:36", "08:12:37", "08:12:38", "08:12:38",
"08:13:25", "08:13:25", "08:14:09", "08:14:18", "08:14:19", "08:14:24",
"08:14:24", "08:14:25", "08:14:37", "08:14:38", "08:14:58", "08:14:58",
"08:14:59", "08:14:59", "08:15:03", "08:15:04", "08:15:04", "08:15:05",
"08:15:12", "08:15:13", "08:15:13", "08:15:33", "08:15:34", "08:15:37",
"08:15:39", "08:15:51", "08:16:12", "08:16:13", "08:16:14", "08:16:31",
"08:16:32", "08:16:42", "08:17:00", "08:17:00", "08:17:01", "08:17:03",
"08:17:19", "08:17:20", "08:17:22", "08:17:26", "08:17:26", "08:17:27",
"08:17:27", "08:17:32", "08:17:32", "08:17:33", "08:17:50", "08:17:51",
"08:17:51", "08:17:52", "08:18:38", "08:18:39", "08:18:39", "08:18:40",
"08:18:41", "08:18:41", "08:19:44", "08:19:44", "08:19:46", "08:19:46",
"08:22:27", "08:23:20", "08:23:20", "08:23:47", "08:23:48", "08:23:48",
"08:23:52", "08:23:52"), id = c(2L, 3L, 2L, 3L, 4L, 5L, 3L, 4L,
3L, 2L, 3L, 3L, 2L, 4L, 5L, 2L, 3L, 2L, 2L, 2L, 4L, 3L, 2L, 2L,
4L, 5L, 3L, 2L, 4L, 5L, 3L, 3L, 4L, 5L, 6L, 4L, 3L, 5L, 4L, 5L,
4L, 3L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L,
5L, 3L, 2L, 4L, 5L, 3L, 2L, 2L, 4L, 5L, 3L, 3L, 4L, 5L, 6L, 4L,
3L, 5L, 4L, 5L, 4L, 3L, 2L, 2L, 3L, 2L, 4L, 5L, 3L, 3L, 4L, 5L,
6L, 4L, 3L, 5L, 4L, 5L, 4L, 3L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 3L,
2L, 4L, 5L, 3L, 3L, 4L, 5L, 6L, 4L, 3L)), class = "data.frame", row.names = c(NA,
-114L))
head(df)
date time id
1 06/01/2021 08:02:54 2
2 06/01/2021 08:04:48 3
3 06/01/2021 08:04:49 2
4 06/01/2021 08:05:49 3
5 06/01/2021 08:05:50 4
6 06/01/2021 08:05:50 5
【问题讨论】:
什么定义了“访问”?每一行都是一次访问吗? ID 2 在同一天有时间在08:02:54
和08:04:49
- 这两个都是访问吗?
是的,他们是,每一行都是一次访问
【参考方案1】:
ggplot()
绘图解决方案将数据视为特定时间步长和所有时间步长的因子变量。
id
和date
的累计访问次数:
library(data.table)
dt=as.data.table(df)
dd<-dt[ , count := .N, by = .(id, date)]
dd$date<-as.factor(dd$date)
创建情节:
ggplot(dd, aes(y=id, x=time, fill=count)) +
geom_tile() +
scale_x_discrete(breaks = c("08:02:54","08:05:50", "08:07:34","08:10:02","08:13:25","08:16:32","08:19:44","08:23:52"))+ # remove this for all time-steps
facet_wrap(~date)+
scale_fill_gradient(low="lightyellow", high="red") +
labs(x="Time", y="Id", title="", fill="Number of visits") +
theme_bw()+
theme(plot.title = element_text(hjust = 0.5, face="bold", size=20, color="black")) +
theme(axis.title.x = element_text(family="Times", face="bold", size=16, color="black"))+
theme(axis.title.y = element_text(family="Times", face="bold", size=16, color="black"))+
theme(axis.text.x = element_text( hjust = 1, face="bold", size=14, color="black", angle=90) )+
theme(axis.text.y = element_text( hjust = 1, face="bold", size=14, color="black") )+
theme(plot.title = element_text(hjust = 0.5))+
theme(legend.title = element_text(family="Times", color = "black", size = 16,face="bold"),
legend.text = element_text(family="Times", color = "black", size = 14,face="bold"),
legend.position="right",
plot.title = element_text(hjust = 0.5))+
theme(strip.text.x = element_text(size = 16, colour = "black",family="Times", face="bold"))
或者没有face_wrap()
【讨论】:
这是一个不错的解决方案,我没想过要尝试。【参考方案2】:你的意思是这样的吗?
使用lubridate
将您的数据转换为日期时间对象(更易于处理),然后cumsum(!duplicated(datetime))
用于统计id
的(唯一)访问次数。然后用ggplot2
绘制。
最后一行允许您修改x-axis
中断。
df %>%
mutate(datetime = as_datetime(paste(as.Date(date, "%d/%m/%y"), time))) %>%
group_by(id) %>%
mutate(cumsum = cumsum(!duplicated(datetime))) %>%
ggplot(aes(x = datetime, y = cumsum, color = factor(id), group = id)) +
geom_line() +
scale_x_datetime(breaks = scales::date_breaks("1 day"), date_labels = "%D - %H:%M")
【讨论】:
以上是关于R中的累积访问时间序列图的主要内容,如果未能解决你的问题,请参考以下文章
R语言ggplot2可视化绘制累积计数图(累加图,cumulative counts)
R 中的 Weibull 参数估计,同时考虑 X(时间)和 Y(累积观察)
R语言ggplot2可视化绘制累计频率图累积分布图(cumulative frequency/density distribution)