如果两个连续事件的时间戳差异大于 30 分钟，则填充会话 ID 并生成新的会话 ID

Posted 2023-03-25

技术标签:

【中文标题】如果两个连续事件的时间戳差异大于 30 分钟，则填充会话 ID 并生成新的会话 ID【英文标题】：populate the session id and generate new session id if timestamp difference two consecutive event is more then 30 min 【发布时间】：2022-01-06 09:02:33 【问题描述】：

输入 - 从现有的 hive 或 redshift 表中读取

user   |    Timestamp    |  SessionId
---------------------------------------
u1     |    10:00AM      |      ?    
u1     |    10:05AM      |      ?    
u1     |    10:10AM      |      ?    
u1     |    10:15AM      |      ?    
u1     |    11:40AM      |      ?    
u1     |    11:50AM      |      ?    
u1     |    12:15PM      |      ?

预期输出

user   |    Timestamp    |  SessionId
---------------------------------------
u1     |    10:00AM      |      s1    
u1     |    10:05AM      |      s1    
u1     |    10:10AM      |      s1    
u1     |    10:15AM      |      s1    
u1     |    11:40AM      |      s2    
u1     |    11:50AM      |      s2    
u1     |    12:15PM      |      s3

我们将如何解决这个问题以使用 hive 或 redshift 更新现有表？

【问题讨论】：

时间戳是否完全采用该格式：hh:mma? 【参考方案1】：

将时间戳转换为 unix_timestamp（秒），使用 lag() 函数获取上一个时间戳，计算差异并分配 new_session=1 如果超过 30 分钟，计算 new_session 的运行总和以获取会话 ID。

with mydata as (
select 'u1' as `user`,'10:00AM' `timestamp` union all    
select 'u1','10:05AM' union all    
select 'u1','10:10AM' union all    
select 'u1','10:15AM' union all    
select 'u1','11:40AM' union all    
select 'u1','11:50AM' union all    
select 'u1','12:15AM' -----------15 min after midnight
)

select `user`, `timestamp`, 
       concat('s',sum(new_session) over(partition by `user` order by `timestamp`)) as session_id
from 
(
select --calculate new_session flag based on differennce between ts and prev_ts
      `user`, `timestamp`, ts, prev_ts,
      case when ((ts-prev_ts)/60 > 30) or prev_ts is NULL then 1 end as new_session
from      
(
select `user`, `timestamp`, ts, 
      --calculate previous time
       lag(ts) over(partition by `user` order by ts) prev_ts
from
(
--convert time to seconds
select `user`, `timestamp`, unix_timestamp(`timestamp`,'hh:mma') as ts from mydata
)s --ts conversion
)s --prev_ts
)s --new_session

结果：

user    timestamp   session_id
u1      10:00AM      s1
u1      10:05AM      s1
u1      10:10AM      s1
u1      10:15AM      s1
u1      11:40AM      s2
u1      11:50AM      s2
u1      12:15AM      s3

请注意，我将 12:15PM 更改为 12:15AM 以获取 s3 会话，因为在您的数据示例中，12:15PM 是中午后 15 分钟，11:50AM 和 12:15PM 之间的差异是 25 分钟并且不会触发新会话。要获得像您的问题中那样的 S3 会话，应该是上午 12:15。 12:15AM 是午夜后 15 分钟，请参阅 12-hour_clock wiki

【讨论】：

以上是关于如果两个连续事件的时间戳差异大于 30 分钟，则填充会话 ID 并生成新的会话 ID的主要内容，如果未能解决你的问题，请参考以下文章