BigQuery 对数组类型的许多字段的最佳查询
Posted
技术标签:
【中文标题】BigQuery 对数组类型的许多字段的最佳查询【英文标题】:Optimal query for BigQuery for many fields that array type 【发布时间】:2017-01-31 13:01:43 【问题描述】:在 Google BigQuery 中,我定义了包含 5 个字段的表 我正在从 json 格式加载它 架构如下,让我们调用表 user_data
BigQuery 中的 Array 类型只是可重复字段
userid: String
cats: Array[Int]
features:Array[Long]
segments:Array[Int]
tags:Array[Int]
我需要运行类似的查询
select count userid
from user_data
where
(123,265) in cats and
(555,666,777) in segments and
(100, 200) in tags
运行此类查询的最佳永久方式是什么,它的语法应该是什么?
【问题讨论】:
【参考方案1】:试试下面。它适用于 BigQuery 标准 SQL
#standardSQL
WITH user_data AS (
SELECT '1' AS userid, ARRAY<INT64>[123,265] AS cats, ARRAY<INT64>[1,2] AS features, ARRAY<INT64>[555,666,777] AS segments, ARRAY<INT64>[100, 200] AS tags UNION ALL
SELECT '2' AS userid, ARRAY<INT64>[1231,265] AS cats, ARRAY<INT64>[1,2] AS features, ARRAY<INT64>[555,666,777] AS segments, ARRAY<INT64>[100, 200] AS tags UNION ALL
SELECT '3' AS userid, ARRAY<INT64>[123,265] AS cats, ARRAY<INT64>[1,2] AS features, ARRAY<INT64>[5551,666,777] AS segments, ARRAY<INT64>[100, 200] AS tags
)
SELECT COUNT(userid) AS count_userid
FROM user_data
WHERE (SELECT COUNT(DISTINCT cat) FROM UNNEST(cats) AS cat WHERE cat IN (123, 265)) = 2
AND (SELECT COUNT(DISTINCT segment) FROM UNNEST(segments) AS segment WHERE segment IN (555,666,777)) = 3
AND (SELECT COUNT(DISTINCT tag) FROM UNNEST(tags) AS tag WHERE tag IN (100, 200)) = 2
【讨论】:
谢谢,它会在单个维度的值之间执行 OR 吗? 它查找所有期望值都在各自的维度中。如果您希望每个维度至少有一个 - 您应该将 =2 和 =3 更改为 >0【参考方案2】:米哈伊尔答案的变体。我相信 Julias 想要计算每个维度上的条件为真的用户,即至少有一个常量匹配。在这种情况下EXISTS
将比COUNT(DISTINCT)
更有效,即
#standardSQL
WITH user_data AS (
SELECT '1' AS userid, ARRAY<INT64>[123,265] AS cats, ARRAY<INT64>[1,2] AS features, ARRAY<INT64>[555,666,777] AS segments, ARRAY<INT64>[100, 200] AS tags UNION ALL
SELECT '2' AS userid, ARRAY<INT64>[1231,265] AS cats, ARRAY<INT64>[1,2] AS features, ARRAY<INT64>[555,666,777] AS segments, ARRAY<INT64>[100, 200] AS tags UNION ALL
SELECT '3' AS userid, ARRAY<INT64>[123,265] AS cats, ARRAY<INT64>[1,2] AS features, ARRAY<INT64>[5551,666,777] AS segments, ARRAY<INT64>[100, 200] AS tags
)
SELECT COUNT(userid) AS count_userid
FROM user_data
WHERE EXISTS(SELECT 1 FROM UNNEST(cats) cat WHERE cat IN (123, 265))
AND EXISTS(SELECT 1 FROM UNNEST(segments) segment WHERE segment IN (555,666,777))
AND EXISTS(SELECT 1 FROM UNNEST(tags) tag WHERE tag IN (100, 200))
【讨论】:
以上是关于BigQuery 对数组类型的许多字段的最佳查询的主要内容,如果未能解决你的问题,请参考以下文章
从 Google BigQuery 中的选择中排除数组类型字段