如何将时间戳彼此接近的会话分组?
Posted
技术标签:
【中文标题】如何将时间戳彼此接近的会话分组?【英文标题】:How to group sessions that have timestamps close to each other? 【发布时间】:2019-12-05 01:17:38 【问题描述】:我的场景要求我将相隔不到 60 秒的会话视为同一会话。
数据如下。
Min_Timestamp Max_Timestamp Device_ID Session_ID Prev_Max_Timestamp Diff_Sec
2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:13.502 UTC AAAAA I90HYTRFJI null null
2019-12-03 23:09:21.517 UTC 2019-12-03 23:09:53.353 UTC AAAAA 98UHIGSNJR 2019-12-03 23:09:13.502 UTC 8
2019-12-03 00:00:28.933 UTC 2019-12-03 00:09:03.473 UTC BBBBB 32QE8Y76TG null null
2019-12-03 00:09:19.106 UTC 2019-12-03 00:23:26.554 UTC BBBBB R4GUY432AD 2019-12-03 00:09:03.473 UTC 16
2019-12-03 00:23:26.818 UTC 2019-12-03 00:23:26.837 UTC BBBBB E32GUYE328 2019-12-03 00:23:26.554 UTC 0
2019-12-03 17:00:32.160 UTC 2019-12-03 17:03:48.758 UTC BBBBB GY1EW32876 2019-12-03 00:23:26.837 UTC 59825
2019-12-03 17:03:58.069 UTC 2019-12-03 17:17:12.408 UTC BBBBB 2876T128Y7 2019-12-03 17:03:48.758 UTC 9
2019-12-03 17:18:24.528 UTC 2019-12-03 17:18:27.516 UTC BBBBB 098U6598U5 2019-12-03 17:17:12.408 UTC 73
2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC UWI4UII2J4 null null
2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC G3247ROIUH 2019-12-03 18:44:18.972 UTC 82080
将相隔不到 60 秒但仍按设备分开的会话组合在一起。它看起来像这样。
Min_Timestamp Max_Timestamp Device_ID Session_ID Prev_Max_Timestamp Diff_Sec
2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:13.502 UTC AAAAA I90HYTRFJI null null
2019-12-03 23:09:21.517 UTC 2019-12-03 23:09:53.353 UTC AAAAA 98UHIGSNJR 2019-12-03 23:09:13.502 UTC 8
2019-12-03 00:00:28.933 UTC 2019-12-03 00:09:03.473 UTC BBBBB 32QE8Y76TG null null
2019-12-03 00:09:19.106 UTC 2019-12-03 00:23:26.554 UTC BBBBB R4GUY432AD 2019-12-03 00:09:03.473 UTC 16
2019-12-03 00:23:26.818 UTC 2019-12-03 00:23:26.837 UTC BBBBB E32GUYE328 2019-12-03 00:23:26.554 UTC 0
2019-12-03 17:00:32.160 UTC 2019-12-03 17:03:48.758 UTC BBBBB GY1EW32876 2019-12-03 00:23:26.837 UTC 59825
2019-12-03 17:03:58.069 UTC 2019-12-03 17:17:12.408 UTC BBBBB 2876T128Y7 2019-12-03 17:03:48.758 UTC 9
2019-12-03 17:18:24.528 UTC 2019-12-03 17:18:27.516 UTC BBBBB 098U6598U5 2019-12-03 17:17:12.408 UTC 73
2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC UWI4UII2J4 null null
2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC G3247ROIUH 2019-12-03 18:44:18.972 UTC 82080
我希望能够得到类似这样的东西。 Session_ID
不需要像 A1、B1、C1 等。它可以简单地是会话的第一个值。注意最新的Max_Timestamp
现在是新的Max_Timestamp
。
Min_Timestamp Max_Timestamp Device_ID Session_ID
2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:53.353 UTC AAAAA A1
2019-12-03 00:00:28.933 UTC 2019-12-03 00:23:26.837 UTC BBBBB B1
2019-12-03 17:00:32.160 UTC 2019-12-03 17:18:27.516 UTC BBBBB B2
2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC C1
2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC C2
我的想法是使属于同一组的所有Session_ID
相同。然后按Device_ID
和Session_ID
分组得到min(Min_Timestamp)
和max(Max_Timestamp).
我试图在Session_ID
上摆弄first_value()
,但我不知道如何正确分区。
最好在旧版中实现这一点。如果没有,标准也会起作用。
【问题讨论】:
你自己尝试过什么吗?你能表现出一些努力吗?它相对简单,此外您应该能够在 SO 上找到类似的问题!你试过搜索吗? 我有。我在问题中提到我尝试在 Session_ID 上使用 first_value() 并将其分区,但我无法弄清楚如何正确分区。我已经在这里搜索过了。可能是我的搜索关键字不正确。介意分享一篇包含此问题答案的帖子吗? 对我个人而言 - 只回答(我很可能会在有时间时做,除非到那时已经回答)你的问题比搜索更容易 -但是 SO etiquette 仍然希望您付出一些努力来搜索类似的问题(我知道事实上这里有很多)和/或通过您尝试过的查询示例以及它如何没有按照您想要的方式工作来展示您的努力,等等。否则这些帖子看起来像是试图外包你的家庭作业,这在 SO 上是不受欢迎的。 我非常明白我需要自己努力。我完全理解 SO 不是我复制和粘贴问题并期望有人为我完成工作的地方。我的努力无处不在。我没有铅。我什至不知道使用 first_value() 是否是正确的方法,我很确定它不是。我已经知道的不起作用的代码不是正确的方法,甚至不是解决方案。我什至不知道如何寻找这个问题的答案。这是我真的不知道从哪里开始的问题类型。我什至无法接近解决方案。 明白了!顺便说一句,为什么你更喜欢传统? 【参考方案1】:以下是 BigQuery 标准 SQL(如果您愿意 - 只需将其“翻译”为旧版 - 但建议还是迁移到标准版!!!现在就这样做并在下面使用)
#standardSQL
SELECT MIN(Min_Timestamp) AS Min_Timestamp, MAX(Max_Timestamp) AS Max_Timestamp, Device_ID, Session_ID
FROM (
SELECT * EXCEPT(flag, Session_ID),
CONCAT(Device_ID, CAST(COUNTIF(flag) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp) AS STRING)) AS Session_ID
FROM (
SELECT *,
IFNULL(TIMESTAMP_DIFF(Min_Timestamp, LAG(Max_Timestamp) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp), SECOND), 999) > 60 flag
FROM `project.dataset.table`
)
)
GROUP BY Device_ID, Session_ID
您可以使用您问题中的示例数据进行测试,如以下示例所示
#standardSQL
WITH `project.dataset.table` AS (
SELECT TIMESTAMP '2019-12-03 23:05:30.416 UTC' Min_Timestamp, TIMESTAMP '2019-12-03 23:09:13.502 UTC' Max_Timestamp, 'AAAAA' Device_ID, 'I90HYTRFJI' Session_ID UNION ALL
SELECT '2019-12-03 23:09:21.517 UTC', '2019-12-03 23:09:53.353 UTC', 'AAAAA', '98UHIGSNJR' UNION ALL
SELECT '2019-12-03 00:00:28.933 UTC', '2019-12-03 00:09:03.473 UTC', 'BBBBB', '32QE8Y76TG' UNION ALL
SELECT '2019-12-03 00:09:19.106 UTC', '2019-12-03 00:23:26.554 UTC', 'BBBBB', 'R4GUY432AD' UNION ALL
SELECT '2019-12-03 00:23:26.818 UTC', '2019-12-03 00:23:26.837 UTC', 'BBBBB', 'E32GUYE328' UNION ALL
SELECT '2019-12-03 17:00:32.160 UTC', '2019-12-03 17:03:48.758 UTC', 'BBBBB', 'GY1EW32876' UNION ALL
SELECT '2019-12-03 17:03:58.069 UTC', '2019-12-03 17:17:12.408 UTC', 'BBBBB', '2876T128Y7' UNION ALL
SELECT '2019-12-03 17:18:24.528 UTC', '2019-12-03 17:18:27.516 UTC', 'BBBBB', '098U6598U5' UNION ALL
SELECT '2019-12-03 16:30:29.970 UTC', '2019-12-03 18:44:18.972 UTC', 'CCCCC', 'UWI4UII2J4' UNION ALL
SELECT '2019-12-04 17:32:19.285 UTC', '2019-12-04 17:32:24.668 UTC', 'CCCCC', 'G3247ROIUH'
)
SELECT MIN(Min_Timestamp) AS Min_Timestamp, MAX(Max_Timestamp) AS Max_Timestamp, Device_ID, Session_ID
FROM (
SELECT * EXCEPT(flag, Session_ID),
CONCAT(Device_ID, CAST(COUNTIF(flag) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp) AS STRING)) AS Session_ID
FROM (
SELECT *,
IFNULL(TIMESTAMP_DIFF(Min_Timestamp, LAG(Max_Timestamp) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp), SECOND), 999) > 60 flag
FROM `project.dataset.table`
)
)
GROUP BY Device_ID, Session_ID
-- ORDER BY Device_ID, Session_ID
有输出
Row Min_Timestamp Max_Timestamp Device_ID Session_ID
1 2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:53.353 UTC AAAAA AAAAA1
2 2019-12-03 00:00:28.933 UTC 2019-12-03 00:23:26.837 UTC BBBBB BBBBB1
3 2019-12-03 17:00:32.160 UTC 2019-12-03 17:17:12.408 UTC BBBBB BBBBB2
4 2019-12-03 17:18:24.528 UTC 2019-12-03 17:18:27.516 UTC BBBBB BBBBB3
5 2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC CCCCC1
6 2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC CCCCC2
【讨论】:
效果很好!谢谢你,现在我知道如何使用countif了。我从来没有意识到它可以这样使用! 太棒了。还考虑对有帮助的答案进行投票! :o) 我做了 :) 由于某些限制,它没有注册。现在它就在那里。以上是关于如何将时间戳彼此接近的会话分组?的主要内容,如果未能解决你的问题,请参考以下文章
有没有办法将时间戳向上或向下舍入到最接近的 30 分钟间隔?