如何将时间戳彼此接近的会话分组?

Posted

技术标签:

【中文标题】如何将时间戳彼此接近的会话分组?【英文标题】:How to group sessions that have timestamps close to each other? 【发布时间】:2019-12-05 01:17:38 【问题描述】:

我的场景要求我将相隔不到 60 秒的会话视为同一会话。

数据如下。

Min_Timestamp                Max_Timestamp                Device_ID  Session_ID  Prev_Max_Timestamp           Diff_Sec
2019-12-03 23:05:30.416 UTC  2019-12-03 23:09:13.502 UTC  AAAAA      I90HYTRFJI  null                         null
2019-12-03 23:09:21.517 UTC  2019-12-03 23:09:53.353 UTC  AAAAA      98UHIGSNJR  2019-12-03 23:09:13.502 UTC  8
2019-12-03 00:00:28.933 UTC  2019-12-03 00:09:03.473 UTC  BBBBB      32QE8Y76TG  null                         null
2019-12-03 00:09:19.106 UTC  2019-12-03 00:23:26.554 UTC  BBBBB      R4GUY432AD  2019-12-03 00:09:03.473 UTC  16
2019-12-03 00:23:26.818 UTC  2019-12-03 00:23:26.837 UTC  BBBBB      E32GUYE328  2019-12-03 00:23:26.554 UTC  0
2019-12-03 17:00:32.160 UTC  2019-12-03 17:03:48.758 UTC  BBBBB      GY1EW32876  2019-12-03 00:23:26.837 UTC  59825
2019-12-03 17:03:58.069 UTC  2019-12-03 17:17:12.408 UTC  BBBBB      2876T128Y7  2019-12-03 17:03:48.758 UTC  9
2019-12-03 17:18:24.528 UTC  2019-12-03 17:18:27.516 UTC  BBBBB      098U6598U5  2019-12-03 17:17:12.408 UTC  73
2019-12-03 16:30:29.970 UTC  2019-12-03 18:44:18.972 UTC  CCCCC      UWI4UII2J4  null                         null
2019-12-04 17:32:19.285 UTC  2019-12-04 17:32:24.668 UTC  CCCCC      G3247ROIUH  2019-12-03 18:44:18.972 UTC  82080

将相隔不到 60 秒但仍按设备分开的会话组合在一起。它看起来像这样。

Min_Timestamp                Max_Timestamp                Device_ID  Session_ID  Prev_Max_Timestamp           Diff_Sec
2019-12-03 23:05:30.416 UTC  2019-12-03 23:09:13.502 UTC  AAAAA      I90HYTRFJI  null                         null
2019-12-03 23:09:21.517 UTC  2019-12-03 23:09:53.353 UTC  AAAAA      98UHIGSNJR  2019-12-03 23:09:13.502 UTC  8

2019-12-03 00:00:28.933 UTC  2019-12-03 00:09:03.473 UTC  BBBBB      32QE8Y76TG  null                         null
2019-12-03 00:09:19.106 UTC  2019-12-03 00:23:26.554 UTC  BBBBB      R4GUY432AD  2019-12-03 00:09:03.473 UTC  16
2019-12-03 00:23:26.818 UTC  2019-12-03 00:23:26.837 UTC  BBBBB      E32GUYE328  2019-12-03 00:23:26.554 UTC  0

2019-12-03 17:00:32.160 UTC  2019-12-03 17:03:48.758 UTC  BBBBB      GY1EW32876  2019-12-03 00:23:26.837 UTC  59825
2019-12-03 17:03:58.069 UTC  2019-12-03 17:17:12.408 UTC  BBBBB      2876T128Y7  2019-12-03 17:03:48.758 UTC  9
2019-12-03 17:18:24.528 UTC  2019-12-03 17:18:27.516 UTC  BBBBB      098U6598U5  2019-12-03 17:17:12.408 UTC  73

2019-12-03 16:30:29.970 UTC  2019-12-03 18:44:18.972 UTC  CCCCC      UWI4UII2J4  null                         null

2019-12-04 17:32:19.285 UTC  2019-12-04 17:32:24.668 UTC  CCCCC      G3247ROIUH  2019-12-03 18:44:18.972 UTC  82080

我希望能够得到类似这样的东西。 Session_ID 不需要像 A1、B1、C1 等。它可以简单地是会话的第一个值。注意最新的Max_Timestamp 现在是新的Max_Timestamp

Min_Timestamp                Max_Timestamp                Device_ID  Session_ID
2019-12-03 23:05:30.416 UTC  2019-12-03 23:09:53.353 UTC  AAAAA      A1          
2019-12-03 00:00:28.933 UTC  2019-12-03 00:23:26.837 UTC  BBBBB      B1
2019-12-03 17:00:32.160 UTC  2019-12-03 17:18:27.516 UTC  BBBBB      B2
2019-12-03 16:30:29.970 UTC  2019-12-03 18:44:18.972 UTC  CCCCC      C1
2019-12-04 17:32:19.285 UTC  2019-12-04 17:32:24.668 UTC  CCCCC      C2

我的想法是使属于同一组的所有Session_ID 相同。然后按Device_IDSession_ID分组得到min(Min_Timestamp)max(Max_Timestamp). 我试图在Session_ID 上摆弄first_value(),但我不知道如何正确分区。

最好在旧版中实现这一点。如果没有,标准也会起作用。

【问题讨论】:

你自己尝试过什么吗?你能表现出一些努力吗?它相对简单,此外您应该能够在 SO 上找到类似的问题!你试过搜索吗? 我有。我在问题中提到我尝试在 Session_ID 上使用 first_value() 并将其分区,但我无法弄清楚如何正确分区。我已经在这里搜索过了。可能是我的搜索关键字不正确。介意分享一篇包含此问题答案的帖子吗? 对我个人而言 - 只回答(我很可能会在有时间时做,除非到那时已经回答)你的问题比搜索更容易 -但是 SO etiquette 仍然希望您付出一些努力来搜索类似的问题(我知道事实上这里有很多)和/或通过您尝试过的查询示例以及它如何没有按照您想要的方式工作来展示您的努力,等等。否则这些帖子看起来像是试图外包你的家庭作业,这在 SO 上是不受欢迎的。 我非常明白我需要自己努力。我完全理解 SO 不是我复制和粘贴问题并期望有人为我完成工作的地方。我的努力无处不在。我没有铅。我什至不知道使用 first_value() 是否是正确的方法,我很确定它不是。我已经知道的不起作用的代码不是正确的方法,甚至不是解决方案。我什至不知道如何寻找这个问题的答案。这是我真的不知道从哪里开始的问题类型。我什至无法接近解决方案。 明白了!顺便说一句,为什么你更喜欢传统? 【参考方案1】:

以下是 BigQuery 标准 SQL(如果您愿意 - 只需将其“翻译”为旧版 - 但建议还是迁移到标准版!!!现在就这样做并在下面使用)

#standardSQL
SELECT MIN(Min_Timestamp) AS Min_Timestamp, MAX(Max_Timestamp) AS Max_Timestamp, Device_ID, Session_ID
FROM (
  SELECT * EXCEPT(flag, Session_ID), 
    CONCAT(Device_ID, CAST(COUNTIF(flag) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp) AS STRING)) AS Session_ID
  FROM (
    SELECT *, 
      IFNULL(TIMESTAMP_DIFF(Min_Timestamp, LAG(Max_Timestamp) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp), SECOND), 999) > 60 flag
    FROM `project.dataset.table`
  )
)
GROUP BY Device_ID, Session_ID

您可以使用您问题中的示例数据进行测试,如以下示例所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT TIMESTAMP '2019-12-03 23:05:30.416 UTC' Min_Timestamp, TIMESTAMP '2019-12-03 23:09:13.502 UTC' Max_Timestamp, 'AAAAA' Device_ID, 'I90HYTRFJI' Session_ID UNION ALL
  SELECT '2019-12-03 23:09:21.517 UTC', '2019-12-03 23:09:53.353 UTC', 'AAAAA', '98UHIGSNJR' UNION ALL
  SELECT '2019-12-03 00:00:28.933 UTC', '2019-12-03 00:09:03.473 UTC', 'BBBBB', '32QE8Y76TG' UNION ALL
  SELECT '2019-12-03 00:09:19.106 UTC', '2019-12-03 00:23:26.554 UTC', 'BBBBB', 'R4GUY432AD' UNION ALL
  SELECT '2019-12-03 00:23:26.818 UTC', '2019-12-03 00:23:26.837 UTC', 'BBBBB', 'E32GUYE328' UNION ALL
  SELECT '2019-12-03 17:00:32.160 UTC', '2019-12-03 17:03:48.758 UTC', 'BBBBB', 'GY1EW32876' UNION ALL
  SELECT '2019-12-03 17:03:58.069 UTC', '2019-12-03 17:17:12.408 UTC', 'BBBBB', '2876T128Y7' UNION ALL
  SELECT '2019-12-03 17:18:24.528 UTC', '2019-12-03 17:18:27.516 UTC', 'BBBBB', '098U6598U5' UNION ALL
  SELECT '2019-12-03 16:30:29.970 UTC', '2019-12-03 18:44:18.972 UTC', 'CCCCC', 'UWI4UII2J4' UNION ALL
  SELECT '2019-12-04 17:32:19.285 UTC', '2019-12-04 17:32:24.668 UTC', 'CCCCC', 'G3247ROIUH' 
)
SELECT MIN(Min_Timestamp) AS Min_Timestamp, MAX(Max_Timestamp) AS Max_Timestamp, Device_ID, Session_ID
FROM (
  SELECT * EXCEPT(flag, Session_ID), 
    CONCAT(Device_ID, CAST(COUNTIF(flag) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp) AS STRING)) AS Session_ID
  FROM (
    SELECT *, 
      IFNULL(TIMESTAMP_DIFF(Min_Timestamp, LAG(Max_Timestamp) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp), SECOND), 999) > 60 flag
    FROM `project.dataset.table`
  )
)
GROUP BY Device_ID, Session_ID
-- ORDER BY Device_ID, Session_ID  

有输出

Row Min_Timestamp               Max_Timestamp               Device_ID   Session_ID   
1   2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:53.353 UTC AAAAA       AAAAA1   
2   2019-12-03 00:00:28.933 UTC 2019-12-03 00:23:26.837 UTC BBBBB       BBBBB1   
3   2019-12-03 17:00:32.160 UTC 2019-12-03 17:17:12.408 UTC BBBBB       BBBBB2   
4   2019-12-03 17:18:24.528 UTC 2019-12-03 17:18:27.516 UTC BBBBB       BBBBB3   
5   2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC       CCCCC1   
6   2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC       CCCCC2     

【讨论】:

效果很好!谢谢你,现在我知道如何使用countif了。我从来没有意识到它可以这样使用! 太棒了。还考虑对有帮助的答案进行投票! :o) 我做了 :) 由于某些限制,它没有注册。现在它就在那里。

以上是关于如何将时间戳彼此接近的会话分组?的主要内容,如果未能解决你的问题,请参考以下文章

有没有办法将时间戳向上或向下舍入到最接近的 30 分钟间隔?

如何对彼此“接近”的纬度/经度点进行分组?

如何在打字稿/离子2中按周分组时间戳?

Pandas - 将时间戳四舍五入到最接近的秒数

将时间戳数据与另一个数据集中的最接近时间相匹配。正确矢量化?更快的方式?

如何在时间戳上正确使用分组?