R中的累积访问时间序列图

Posted 2023-02-14

技术标签:

【中文标题】R中的累积访问时间序列图【英文标题】：Cumulative visit time series plot in R 【发布时间】：2022-01-13 12:20:04 【问题描述】：

我有一个大型数据集，它在特定位置全天记录 id 我想做的是绘制每个人id 在收集数据的时间段内的累计访问次数。

数据样本如下所示，完整数据集已访问数天。我尝试了一些使用 cumsum 的变体，但无法正常工作。

 dput(df)
structure(list(date = c("06/01/2021", "06/01/2021", "06/01/2021", 
"06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021", 
"06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021", 
"06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021", 
"06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021", 
"06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021", "06/01/2021", 
"06/01/2021", "06/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", 
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", 
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", 
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", 
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", 
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", 
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", 
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", 
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", 
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", 
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", 
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", 
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", 
"07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", "07/01/2021", 
"07/01/2021", "07/01/2021", "08/01/2021", "08/01/2021", "08/01/2021", 
"08/01/2021", "08/01/2021", "08/01/2021", "08/01/2021", "08/01/2021", 
"08/01/2021", "08/01/2021", "08/01/2021", "08/01/2021", "08/01/2021", 
"08/01/2021"), time = c("08:02:54", "08:04:48", "08:04:49", "08:05:49", 
"08:05:50", "08:05:50", "08:05:51", "08:06:32", "08:06:33", "08:07:34", 
"08:07:34", "08:07:35", "08:07:36", "08:07:36", "08:09:52", "08:09:53", 
"08:09:53", "08:10:02", "08:10:04", "08:10:05", "08:10:05", "08:10:07", 
"08:10:08", "08:10:22", "08:10:42", "08:10:43", "08:11:14", "08:11:15", 
"08:11:38", "08:11:39", "08:11:39", "08:11:40", "08:11:40", "08:11:41", 
"08:11:48", "08:11:50", "08:11:51", "08:11:51", "08:11:52", "08:11:53", 
"08:11:54", "08:11:54", "08:12:36", "08:12:37", "08:12:38", "08:12:38", 
"08:13:25", "08:13:25", "08:14:09", "08:14:18", "08:14:19", "08:14:24", 
"08:14:24", "08:14:25", "08:14:37", "08:14:38", "08:14:58", "08:14:58", 
"08:14:59", "08:14:59", "08:15:03", "08:15:04", "08:15:04", "08:15:05", 
"08:15:12", "08:15:13", "08:15:13", "08:15:33", "08:15:34", "08:15:37", 
"08:15:39", "08:15:51", "08:16:12", "08:16:13", "08:16:14", "08:16:31", 
"08:16:32", "08:16:42", "08:17:00", "08:17:00", "08:17:01", "08:17:03", 
"08:17:19", "08:17:20", "08:17:22", "08:17:26", "08:17:26", "08:17:27", 
"08:17:27", "08:17:32", "08:17:32", "08:17:33", "08:17:50", "08:17:51", 
"08:17:51", "08:17:52", "08:18:38", "08:18:39", "08:18:39", "08:18:40", 
"08:18:41", "08:18:41", "08:19:44", "08:19:44", "08:19:46", "08:19:46", 
"08:22:27", "08:23:20", "08:23:20", "08:23:47", "08:23:48", "08:23:48", 
"08:23:52", "08:23:52"), id = c(2L, 3L, 2L, 3L, 4L, 5L, 3L, 4L, 
3L, 2L, 3L, 3L, 2L, 4L, 5L, 2L, 3L, 2L, 2L, 2L, 4L, 3L, 2L, 2L, 
4L, 5L, 3L, 2L, 4L, 5L, 3L, 3L, 4L, 5L, 6L, 4L, 3L, 5L, 4L, 5L, 
4L, 3L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 
5L, 3L, 2L, 4L, 5L, 3L, 2L, 2L, 4L, 5L, 3L, 3L, 4L, 5L, 6L, 4L, 
3L, 5L, 4L, 5L, 4L, 3L, 2L, 2L, 3L, 2L, 4L, 5L, 3L, 3L, 4L, 5L, 
6L, 4L, 3L, 5L, 4L, 5L, 4L, 3L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 
2L, 4L, 5L, 3L, 3L, 4L, 5L, 6L, 4L, 3L)), class = "data.frame", row.names = c(NA, 
-114L))

head(df)
        date     time id
1 06/01/2021 08:02:54  2
2 06/01/2021 08:04:48  3
3 06/01/2021 08:04:49  2
4 06/01/2021 08:05:49  3
5 06/01/2021 08:05:50  4
6 06/01/2021 08:05:50  5

【问题讨论】：

什么定义了“访问”？每一行都是一次访问吗？ ID 2 在同一天有时间在08:02:54 和08:04:49 - 这两个都是访问吗？是的，他们是，每一行都是一次访问 【参考方案1】：

ggplot() 绘图解决方案将数据视为特定时间步长和所有时间步长的因子变量。

id和date的累计访问次数：

library(data.table)
dt=as.data.table(df)
dd<-dt[ , count := .N, by = .(id, date)]
dd$date<-as.factor(dd$date)

创建情节：

  ggplot(dd, aes(y=id, x=time, fill=count)) +  
      geom_tile() +
      scale_x_discrete(breaks = c("08:02:54","08:05:50", "08:07:34","08:10:02","08:13:25","08:16:32","08:19:44","08:23:52"))+ # remove this for all time-steps
      facet_wrap(~date)+
      scale_fill_gradient(low="lightyellow", high="red") + 
      labs(x="Time", y="Id", title="", fill="Number of visits") + 
      theme_bw()+
      theme(plot.title = element_text(hjust = 0.5,  face="bold", size=20, color="black")) + 
      theme(axis.title.x = element_text(family="Times", face="bold", size=16, color="black"))+
      theme(axis.title.y = element_text(family="Times", face="bold", size=16, color="black"))+
      theme(axis.text.x = element_text( hjust = 1,  face="bold", size=14, color="black", angle=90) )+
      theme(axis.text.y = element_text( hjust = 1,  face="bold", size=14, color="black") )+
      theme(plot.title = element_text(hjust = 0.5))+
      theme(legend.title = element_text(family="Times", color = "black", size = 16,face="bold"),
            legend.text = element_text(family="Times", color = "black", size = 14,face="bold"),
            legend.position="right",
            plot.title = element_text(hjust = 0.5))+
      theme(strip.text.x = element_text(size = 16, colour = "black",family="Times", face="bold"))

或者没有face_wrap()

【讨论】：

这是一个不错的解决方案，我没想过要尝试。【参考方案2】：

你的意思是这样的吗？

使用lubridate 将您的数据转换为日期时间对象（更易于处理），然后cumsum(!duplicated(datetime)) 用于统计id 的（唯一）访问次数。然后用ggplot2绘制。

最后一行允许您修改x-axis 中断。

df %>%
  mutate(datetime = as_datetime(paste(as.Date(date, "%d/%m/%y"), time))) %>% 
  group_by(id) %>% 
  mutate(cumsum = cumsum(!duplicated(datetime))) %>% 
  ggplot(aes(x = datetime, y = cumsum, color = factor(id), group = id)) +
  geom_line() +
  scale_x_datetime(breaks = scales::date_breaks("1 day"), date_labels = "%D - %H:%M")

【讨论】：

以上是关于R中的累积访问时间序列图的主要内容，如果未能解决你的问题，请参考以下文章