BigQuery 对数组类型的许多字段的最佳查询

Posted

技术标签:

【中文标题】BigQuery 对数组类型的许多字段的最佳查询【英文标题】:Optimal query for BigQuery for many fields that array type 【发布时间】:2017-01-31 13:01:43 【问题描述】:

在 Google BigQuery 中,我定义了包含 5 个字段的表 我正在从 json 格式加载它 架构如下,让我们调用表 user_data

BigQuery 中的 Array 类型只是可重复字段

userid: String
cats: Array[Int]
features:Array[Long]
segments:Array[Int]
tags:Array[Int]

我需要运行类似的查询

select count userid
from user_data
where 
 (123,265) in cats and
 (555,666,777) in segments and
 (100, 200) in tags

运行此类查询的最佳永久方式是什么,它的语法应该是什么?

【问题讨论】:

【参考方案1】:

试试下面。它适用于 BigQuery 标准 SQL

#standardSQL
WITH user_data AS (
  SELECT '1' AS userid, ARRAY<INT64>[123,265] AS cats, ARRAY<INT64>[1,2] AS features, ARRAY<INT64>[555,666,777] AS segments, ARRAY<INT64>[100, 200] AS tags UNION ALL
  SELECT '2' AS userid, ARRAY<INT64>[1231,265] AS cats, ARRAY<INT64>[1,2] AS features, ARRAY<INT64>[555,666,777] AS segments, ARRAY<INT64>[100, 200] AS tags UNION ALL
  SELECT '3' AS userid, ARRAY<INT64>[123,265] AS cats, ARRAY<INT64>[1,2] AS features, ARRAY<INT64>[5551,666,777] AS segments, ARRAY<INT64>[100, 200] AS tags 
)
SELECT COUNT(userid) AS count_userid
FROM user_data
WHERE (SELECT COUNT(DISTINCT cat) FROM UNNEST(cats) AS cat WHERE cat IN (123, 265)) = 2
AND (SELECT COUNT(DISTINCT segment) FROM UNNEST(segments) AS segment WHERE segment IN (555,666,777)) = 3
AND (SELECT COUNT(DISTINCT tag) FROM UNNEST(tags) AS tag WHERE tag IN (100, 200)) = 2

【讨论】:

谢谢,它会在单个维度的值之间执行 OR 吗? 它查找所有期望值都在各自的维度中。如果您希望每个维度至少有一个 - 您应该将 =2 和 =3 更改为 >0【参考方案2】:

米哈伊尔答案的变体。我相信 Julias 想要计算每个维度上的条件为真的用户,即至少有一个常量匹配。在这种情况下EXISTS 将比COUNT(DISTINCT) 更有效,即

#standardSQL
WITH user_data AS (
  SELECT '1' AS userid, ARRAY<INT64>[123,265] AS cats, ARRAY<INT64>[1,2] AS features, ARRAY<INT64>[555,666,777] AS segments, ARRAY<INT64>[100, 200] AS tags UNION ALL
  SELECT '2' AS userid, ARRAY<INT64>[1231,265] AS cats, ARRAY<INT64>[1,2] AS features, ARRAY<INT64>[555,666,777] AS segments, ARRAY<INT64>[100, 200] AS tags UNION ALL
  SELECT '3' AS userid, ARRAY<INT64>[123,265] AS cats, ARRAY<INT64>[1,2] AS features, ARRAY<INT64>[5551,666,777] AS segments, ARRAY<INT64>[100, 200] AS tags 
)
SELECT COUNT(userid) AS count_userid
FROM user_data
WHERE EXISTS(SELECT 1 FROM UNNEST(cats) cat WHERE cat IN (123, 265))
AND EXISTS(SELECT 1 FROM UNNEST(segments) segment WHERE segment IN (555,666,777))
AND EXISTS(SELECT 1 FROM UNNEST(tags) tag WHERE tag IN (100, 200))

【讨论】:

以上是关于BigQuery 对数组类型的许多字段的最佳查询的主要内容,如果未能解决你的问题,请参考以下文章

从 Google BigQuery 中的选择中排除数组类型字段

如何在 bigquery 中查询数组?

从 BigQuery 中的查询返回一个数组(重复字段)

BigQuery 中记录类型与展平表的查询性能

如何使用 SQL 查询 BigQuery 中的 BYTES 字段?

Google BigQuery API,如何设置destinationTable 的字段类型?