Bigquery - 根据时差删除行
Posted
技术标签:
【中文标题】Bigquery - 根据时差删除行【英文标题】:Bigquery - remove rows based on time difference 【发布时间】:2020-07-30 05:32:56 【问题描述】:我在 BigQuery 中有一个表。它可以是任何数据库。
我想根据时间条件删除行。发生的情况是,如果用户点击太快,它会创建一个需要删除的重复条目。但在某些情况下,由于 IP 地址或广告商不同,两个有效线索会非常接近并且需要保留。当行之间的间隔在 4 秒内时,我们需要进行重复数据删除。
我还需要确保如果一行被标记为重复,则以下行不使用重复行时间戳来派生 4 秒标志。
ip_address datetime advertiser order_number Comment
34.195.131 2020-07-03 22:45:02.585 UTC homepage 5678 KEEP
34.195.131 2020-07-03 22:45:05.593 UTC homepage 5678 REMOVE - WITHIN 4 SECONDS OF B2
34.195.131 2020-07-03 22:45:08.923 UTC homepage 5678 KEEP - SINCE B3 WAS REMOVED, C4 IS NOW MORE THAN 4 SECONDS FROM B2
34.195.131 2020-07-03 22:45:13.788 UTC homepage 5678 KEEP
34.195.131 2020-07-03 22:45:16.523 UTC homepage 5678 REMOVE - WITHIN 4 SECONDS OF B5
34.195.131 2020-07-03 22:45:20.393 UTC homepage 5678 KEEP - SINCE B6 WAS REMOVED, LESS THAN 4 SECONDS OF B4
34.195.131 2020-07-03 22:45:21.247 UTC homepage 5678 REMOVE - WITHIN 4 SECONDS OF B7
34.195.131 2020-07-03 22:45:24.924 UTC homepage 5678 KEEP - SINCE B8 WAS REMOVED AND MORE THAN 4 SECONDS OF B7
34.195.131 2020-07-03 22:45:27.443 UTC homepage 5678 REMOVE - WITHIN 4 SECONDS OF B9
34.195.131 2020-07-03 22:45:30.561 UTC homepage 5678 KEEP - SINCE B10 WAS REMOVED AND MORE THAN 4 SECONDS OF B9
34.195.131 2020-07-03 22:45:32.561 UTC homepage 5678 REMOVE - WITHIN 4 SECONDS OF B11
34.195.131 2020-07-03 22:45:33.935 UTC homepage 5678 REMOVE - WITHIN 4 SECONDS OF B11
34.195.131 2020-07-03 22:45:36.083 UTC homepage 5678 KEEP - SINCE B12 AND B13 WERE REMOVED AND MORE THAN 4 SECONDS OF B11
34.195.132 2020-07-03 22:45:38.849 UTC homepage 5678 KEEP - EVEN THOUGH WITHIN 4 SECONDS OF B14, THIS IS A DIFFERENT IP_ADDRESS
34.195.132 2020-07-03 22:45:38.949 UTC homepage 1234 KEEP - EVEN THOUGH WITHIN 4 SECONDS OF B15 THIS IS A NEW ORDER_NUMBER
我曾尝试使用 CTE 和自我加入,但目前没有任何成功。谁能告诉我该怎么做或指示如何进一步进行?
【问题讨论】:
【参考方案1】:如果您可以添加有关 B2、B3 等的描述,我不太确定要求,我想 cmets 会更清楚地解码所需的逻辑。 无论如何,根据我的理解,我实现了以下逻辑:
创建虚拟表:
WITH
data as
(
SELECT '34.195.131' ip_address,'2020-07-03 22:45:02.585 UTC' datetime,'homepage' advertiser,'5678' order_number
UNION ALL
SELECT '34.195.131' ip_address,'2020-07-03 22:45:05.593 UTC' datetime,'homepage' advertiser,'5678' order_number
UNION ALL
SELECT '34.195.131' ip_address,'2020-07-03 22:45:08.923 UTC' datetime,'homepage' advertiser,'5678' order_number
UNION ALL
SELECT '34.195.131' ip_address,'2020-07-03 22:45:13.788 UTC' datetime,'homepage' advertiser,'5678' order_number
UNION ALL
SELECT '34.195.131' ip_address,'2020-07-03 22:45:16.523 UTC' datetime,'homepage' advertiser,'5678' order_number
UNION ALL
SELECT '34.195.131' ip_address,'2020-07-03 22:45:20.393 UTC' datetime,'homepage' advertiser,'5678' order_number
UNION ALL
SELECT '34.195.131' ip_address,'2020-07-03 22:45:21.247 UTC' datetime,'homepage' advertiser,'5678' order_number
UNION ALL
SELECT '34.195.131' ip_address,'2020-07-03 22:45:24.924 UTC' datetime,'homepage' advertiser,'5678' order_number
UNION ALL
SELECT '34.195.131' ip_address,'2020-07-03 22:45:27.443 UTC' datetime,'homepage' advertiser,'5678' order_number
UNION ALL
SELECT '34.195.131' ip_address,'2020-07-03 22:45:30.561 UTC' datetime,'homepage' advertiser,'5678' order_number
UNION ALL
SELECT '34.195.131' ip_address,'2020-07-03 22:45:32.561 UTC' datetime,'homepage' advertiser,'5678' order_number
UNION ALL
SELECT '34.195.131' ip_address,'2020-07-03 22:45:33.935 UTC' datetime,'homepage' advertiser,'5678' order_number
UNION ALL
SELECT '34.195.131' ip_address,'2020-07-03 22:45:36.083 UTC' datetime,'homepage' advertiser,'5678' order_number
UNION ALL
SELECT '34.195.132' ip_address,'2020-07-03 22:45:38.849 UTC' datetime,'homepage' advertiser,'5678' order_number
UNION ALL
SELECT '34.195.132' ip_address,'2020-07-03 22:45:38.949 UTC' datetime,'homepage' advertiser,'1234' order_number
),
data_corrected
as
(
SELECT ip_address,CAST(datetime As Timestamp) datetime,advertiser,order_number
From data
)
现在的逻辑是,我使用 Lag 和 Lead 窗口函数来获取后面和前面的值,按 d.ip_address,advertiser,order_number
排序记录,然后计算时间增量。
SELECT d.*, LEAD(datetime)
OVER (PARTITION BY d.ip_address,advertiser,order_number ORDER BY datetime ASC) AS followed_by_click,
CASE WHEN TIMESTAMP_DIFF(LEAD(datetime)
OVER (PARTITION BY d.ip_address,advertiser,order_number ORDER BY datetime ASC),d.datetime , SECOND)<=4 THEN 'Duplicate' ELSE 'Keep' END delta_followed_by_click,
LAG(datetime)
OVER (PARTITION BY d.ip_address,advertiser,order_number ORDER BY datetime ASC) AS preceding_click,
CASE WHEN TIMESTAMP_DIFF(d.datetime , LAG(datetime)
OVER (PARTITION BY d.ip_address,advertiser,order_number ORDER BY datetime ASC), SECOND)<=4 THEN 'Duplicate' ELSE 'Keep' END delta_preceding_click,
FROM data_corrected d
ORDER BY d.datetime desc
希望这有助于您取得成果。
【讨论】:
A、B、C 是 excel 列和 1、2、3,就像 excel 行 注释不是数据的一部分,它们只是说明要保留和删除的内容。 B 列是日期时间以上是关于Bigquery - 根据时差删除行的主要内容,如果未能解决你的问题,请参考以下文章
Google Cloud Dataproc 删除 BigQuery 表不起作用
根据google BigQuery SQL中的属性删除重复行
Dataproc + BigQuery 示例 - 有可用的吗?