如何计算连续行的时差

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何计算连续行的时差相关的知识,希望对你有一定的参考价值。

原始数据看起来像这样,我想按访问者和时间对其进行排序,以计算行中的时差,然后将其保存到新文件中。

  visitor         v_time payment items
1    Jack 1/2/2018 16:07      35     3
2    Jack 1/2/2018 16:09     160     1
3   David 1/2/2018 16:12      25     2
4    Kate 1/2/2018 16:16       3     3
5   David 1/2/2018 16:21      25     5
6    Jack 1/2/2018 16:32      85     5
7    Kate 1/2/2018 16:33     639     3
8    Jack 1/2/2018 16:55       6     2

分组和排序都可以。但它无法计算时差,也无法计算文件。

visitor <- c("Jack", "Jack", "David", "Kate", "David", "Jack", "Kate", "Jack")
v_time <- c("1/2/2018 16:07","1/2/2018 16:09","1/2/2018 16:12","1/2/2018 16:16","1/2/2018 16:21","1/2/2018 16:32","1/2/2018 16:33", "1/2/2018 16:55")
payment <- c(35,160,25,3,25,85,639,6)
items <- c(3,1,2,3,5,5,3,2)
df <- data.frame(visitor, v_time, payment, items)

df %>%
  arrange(visitor, v_time) %>%
  group_by(visitor) %>%
  mutate(diff = strptime(v_time, "%d/%m/%Y %H:%M") - lag(strptime(v_time, "%d/%m/%Y %H:%M")), diff_secs = as.numeric(diff, units = 'secs'))

write.csv(df,"C:/output.csv", row.names = F)

我的错误和正确的做法是什么?

# A tibble: 8 x 6
# Groups: visitor [3]
  visitor v_time         payment items diff   diff_secs
  <fct>   <fct>            <dbl> <dbl> <time>     <dbl>
1 David   1/2/2018 16:12   25.0   2.00 NA            NA
2 David   1/2/2018 16:21   25.0   5.00 NA            NA
3 Jack    1/2/2018 16:07   35.0   3.00 NA            NA
4 Jack    1/2/2018 16:09  160     1.00 NA            NA
5 Jack    1/2/2018 16:32   85.0   5.00 NA            NA
6 Jack    1/2/2018 16:55    6.00  2.00 NA            NA
7 Kate    1/2/2018 16:16    3.00  3.00 NA            NA
8 Kate    1/2/2018 16:33  639     3.00 NA            NA
答案

当你只是将default = strptime(v_time, "%d/%m/%Y %H:%M")[1]添加到lag部分时:

df <- df %>%
  arrange(visitor, v_time) %>%
  group_by(visitor) %>%
  mutate(diff = strptime(v_time, "%d/%m/%Y %H:%M") - lag(strptime(v_time, "%d/%m/%Y %H:%M"), default = strptime(v_time, "%d/%m/%Y %H:%M")[1]),
         diff_secs = as.numeric(diff, units = 'secs'))

你得到了你期望的结果:

> df
# A tibble: 8 x 6
# Groups:   visitor [3]
  visitor v_time         payment items diff   diff_secs
  <fct>   <fct>            <dbl> <dbl> <time>     <dbl>
1 David   1/2/2018 16:12     25.    2. 0             0.
2 David   1/2/2018 16:21     25.    5. 540         540.
3 Jack    1/2/2018 16:07     35.    3. 0             0.
4 Jack    1/2/2018 16:09    160.    1. 120         120.
5 Jack    1/2/2018 16:32     85.    5. 1380       1380.
6 Jack    1/2/2018 16:55      6.    2. 1380       1380.
7 Kate    1/2/2018 16:16      3.    3. 0             0.
8 Kate    1/2/2018 16:33    639.    3. 1020       1020.

另一种选择是使用difftime

df <- df %>%
  arrange(visitor, v_time) %>%
  group_by(visitor) %>%
  mutate(diff = difftime(strptime(v_time, "%d/%m/%Y %H:%M"), lag(strptime(v_time, "%d/%m/%Y %H:%M"), default = strptime(v_time, "%d/%m/%Y %H:%M")[1]), units = 'mins'),
         diff_secs = as.numeric(diff, units = 'secs'))

现在diff-column是几分钟,而diff_sec-column是几秒钟:

> df
# A tibble: 8 x 6
# Groups:   visitor [3]
  visitor v_time         payment items diff   diff_secs
  <fct>   <fct>            <dbl> <dbl> <time>     <dbl>
1 David   1/2/2018 16:12     25.    2. 0             0.
2 David   1/2/2018 16:21     25.    5. 9           540.
3 Jack    1/2/2018 16:07     35.    3. 0             0.
4 Jack    1/2/2018 16:09    160.    1. 2           120.
5 Jack    1/2/2018 16:32     85.    5. 23         1380.
6 Jack    1/2/2018 16:55      6.    2. 23         1380.
7 Kate    1/2/2018 16:16      3.    3. 0             0.
8 Kate    1/2/2018 16:33    639.    3. 17         1020.

您现在可以使用write.csv(df,"C:/output.csv", row.names = FALSE)再次保存结果

另一答案

错误来自lag(strptime(v_time, "%d/%m/%Y %H:%M"))

错误信息:

# Error in format.POSIXlt(x, usetz = TRUE) : 
#  invalid component [[10]] in "POSIXlt" should be 'zone'

要避免这种情况,请尝试strptime(lag(v_time), "%d/%m/%Y %H:%M")

df <- df %>%
    arrange(visitor, v_time) %>%
    group_by(visitor) %>%
    mutate(diff = strptime(v_time, "%d/%m/%Y %H:%M") - strptime(lag(v_time), "%d/%m/%Y %H:%M"), diff_secs = as.numeric(diff, units = 'secs'))
print(df)

输出:

# A tibble: 8 x 6
# Groups:   visitor [3]
  visitor         v_time payment items    diff diff_secs
   <fctr>         <fctr>   <dbl> <dbl>  <time>     <dbl>
1   David 1/2/2018 16:12      25     2 NA mins        NA
2   David 1/2/2018 16:21      25     5  9 mins       540
3    Jack 1/2/2018 16:07      35     3 NA mins        NA
4    Jack 1/2/2018 16:09     160     1  2 mins       120
5    Jack 1/2/2018 16:32      85     5 23 mins      1380
6    Jack 1/2/2018 16:55       6     2 23 mins      1380
7    Kate 1/2/2018 16:16       3     3 NA mins        NA
8    Kate 1/2/2018 16:33     639     3 17 mins      1020

在导出之前,不要忘记使用df保存df <-上的工作。

另一答案

这是lubridate包的方法

library(lubridate)
df$v_time <- mdy_hm(df$v_time)
df <- df %>%
  arrange(visitor, v_time) %>%
  group_by(visitor) 
df$diff <- rep(0,nrow(df))
for(i in 1:(nrow(df)-1)){
  df$diff[i+1] <- df$v_time[i+1]-df$v_time[i]
}
write.csv(df,"C:/output.csv", row.names = F)
另一答案

这是difftime的一个选项。我们使用dmy_hm(来自lubridate)将'v_time'转换为datetime,然后在arrangeing之后,并通过'visitor'进行分组,在几秒钟内将difftime用于输出

library(tidyverse)
out <- df %>% 
        mutate(v_time = dmy_hm(v_time)) %>% 
        arrange(visitor, v_time) %>% 
        group_by(visitor) %>%
        mutate(diff = difftime(v_time, lag(v_time, default = first(v_time)), units = "secs"))
# A tibble: 8 x 5
# Groups: visitor [3]
#  visitor v_time              payment items diff  
#  <fctr>  <dttm>                <dbl> <dbl> <time>
#1 David   2018-02-01 16:12:00   25.0   2.00 0     
#2 David   2018-02-01 16:21:00   25.0   5.00 540   
#3 Jack    2018-02-01 16:07:00   35.0   3.00 0     
#4 Jack    2018-02-01 16:09:00  160     1.00 120   
#5 Jack    2018-02-01 16:32:00   85.0   5.00 1380  
#6 Jack    2018-02-01 16:55:00    6.00  2.00 1380  
#7 Kate    2018-02-01 16:16:00    3.00  3.00 0     
#8 Kate    2018-02-01 16:33:00  639     3.00 1020  

然后,我们用write_csv写csv

write_csv(out, "yourfile.csv")

以上是关于如何计算连续行的时差的主要内容,如果未能解决你的问题,请参考以下文章

每个客户的连续行之间的Haversine距离

处理连续行计算

按列计算连续行和组的距离

SQL 查询 - 计算值大于 X 的连续行数

查询以计算Mysql中连续行中距离(经度,纬度)的总和

计算每天 Ms-Sql 总行中的最大连续行