BigQuery - 从数组中删除重复项
Posted
技术标签:
【中文标题】BigQuery - 从数组中删除重复项【英文标题】:BigQuery - Remove duplicates from array 【发布时间】:2020-03-26 10:58:30 【问题描述】:使用 BigQuery,我想通过一个查询根据标题对页面进行分组,并计算组的不同指标。由于标题的规则不是相互排斥的,所以我是这样做的:
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
CROSS JOIN
UNNEST([
CASE WHEN (title LIKE '%game%')
THEN 'games_group' END,
CASE WHEN (title LIKE '%sport%')
THEN 'sports_group' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group
结果如下:
views ... title_group
3414469869 ...
4355264 ... games_group
1361074 ... sports_group
但是,不属于任何组的页面的浏览量数字 3414469869 是错误的。实际上,当标题不包含“游戏”(或“运动”)时,我们会得到UNNEST([null, "sports_group"]
)(或UNNEST(["games_group", null])
),因此我们仍然计算空组的观看次数。当标题既不包含“游戏”也不包含“运动”时,观看次数甚至会被计算两次。
有没有办法从数组中删除重复项?
【问题讨论】:
【参考方案1】:添加另一个组怎么样?
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` CROSS JOIN
UNNEST([CASE WHEN title LIKE '%game%' THEN 'games_group' END,
CASE WHEN title LIKE '%sport%' THEN 'sports_group' END,
CASE WHEN title NOT LIKE '%game%' AND title NOT LIKE '%sport%' THEN 'Neither' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
wiki = 'en' AND
title_group IS NOT NULL
GROUP BY title_group;
注意:这不考虑 NULL
标题。我不知道这是否重要。
但是,我会使用两列来表达这一点:
SELECT (title LIKE '%game%') as is_game,
(title LIKE '%sport%') as is_sport,
SUM(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
wiki = 'en' AND
title_group IS NOT NULL
GROUP BY is_game, is_sport;
这不会返回与您相同的行 - 游戏和运动分为两行。但是你可以看到组合。
编辑:
现在我想到了,你只想要一个LEFT JOIN
:
SELECT g.title_group, SUM(pv.views) as views,
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` pv LEFT JOIN
(SELECT '%game%' as pattern, 'games_group' as title_group UNION ALL
SELECT '%sport%', 'sports_group' as title_group UNION ALL
) g
ON pv.title LIKE g.pattern
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
wiki = 'en' AND
GROUP BY g.title_group;
【讨论】:
是的,添加另一个组可能是一个解决方案!对于第二个查询,我不能使用它,因为我真的只需要一列。 @丽贝卡。 . .我认为编辑后的解决方案是您真正想要的。【参考方案2】:以下是 BigQuery 标准 SQL
#standardSQL
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`,
UNNEST(
CASE WHEN REGEXP_CONTAINS(title, r'game|sport') THEN
[
CASE WHEN (title LIKE '%game%') THEN 'games_group' END,
CASE WHEN (title LIKE '%sport%') THEN 'sports_group' END
]
ELSE ['other']
END
) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
AND title_group IS NOT NULL
GROUP BY title_group
【讨论】:
以上是关于BigQuery - 从数组中删除重复项的主要内容,如果未能解决你的问题,请参考以下文章