BigQuery:在标准 SQL 中使用重复/数组 STRUCT 字段加入?

Posted

技术标签:

【中文标题】BigQuery:在标准 SQL 中使用重复/数组 STRUCT 字段加入?【英文标题】:BigQuery: JOIN ON with repeated / array STRUCT field in Standard SQL? 【发布时间】:2018-07-02 12:46:30 【问题描述】:

我基本上有两张桌子,OrdersItems。由于这些表是从 Google Cloud Datastore 备份文件导入的,因此引用不是通过简单的 ID 字段进行的,而是通过 <STRUCT> 进行一对一关系,其中其 id 字段表示我想要匹配的实际唯一 ID .对于一对多关系(REPEATED),架构使用<STRUCT> 的数组。

我可以使用 LEFT OUTER JOIN 查询一对一的关系,我也知道如何加入非重复结构和重复的字符串或 int,但我无法使用重复结构

一个订单,一个项目

#standardSQL
WITH Orders AS (
  SELECT 1 AS __oid__, STRUCT(STRUCT(2 AS id, "default" AS ns) AS key) AS item UNION ALL 
  SELECT 2 AS __oid__, STRUCT(STRUCT(4 AS id, "default" AS ns) AS key) AS item UNION ALL 
  SELECT 3 AS __oid__, STRUCT(STRUCT(6 AS id, "default" AS ns) AS key) AS item
),
Items AS (
  SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
  SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
  SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
  SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
  SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
  SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)

SELECT
   __oid__
  ,Order_item AS item
FROM Orders  

LEFT OUTER JOIN(
  SELECT
     key
    ,title
  FROM Items
) Order_item
ON Order_item.key.id = item.key.id

结果(按预期工作):

+-----+---------+--------------+-------------+------------+
| Row | __oid__ |  item.key.id | item.key.ns | item.title |
+-----+---------+--------------+-------------+------------+
|   1 |       1 |            2 |     default |       #1.2 |
+-----+---------+--------------+-------------+------------+
|   2 |       2 |            4 |     default |       #1.4 |
+-----+---------+--------------+-------------+------------+
|   3 |       3 |            6 |     default |       #1.6 |
+-----+---------+--------------+-------------+------------+

类似的查询,但这次是一个带有 许多 项的订单:

#standardSQL
WITH Orders AS (
  SELECT 1 AS __oid__, ARRAY[STRUCT(STRUCT(1 AS id, "default" AS ns) AS key), STRUCT(STRUCT(2 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 2 AS __oid__, ARRAY[STRUCT(STRUCT(3 AS id, "default" AS ns) AS key), STRUCT(STRUCT(4 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 3 AS __oid__, ARRAY[STRUCT(STRUCT(5 AS id, "default" AS ns) AS key), STRUCT(STRUCT(6 AS id, "default" AS ns) AS key)] AS items
),
Items AS (
  SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
  SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
  SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
  SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
  SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
  SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)

SELECT
   __oid__
  ,Order_items AS items
FROM Orders  

LEFT OUTER JOIN(
  SELECT
     key
    ,title
  FROM Items
) Order_items
ON Order_items.key.id IN (SELECT item.key.id FROM UNNEST(items) AS item)

错误:在连接谓词中不支持 IN 子查询。

我其实预料到了这个结果:

+-----+---------+--------------+-------------+------------+
| Row | __oid__ |  item.key.id | item.key.ns | item.title |
+-----+---------+--------------+-------------+------------+
|   1 |       1 |            1 |     default |       #1.1 |
|     |         |            2 |     default |       #1.2 |
+-----+---------+--------------+-------------+------------+
|   2 |       2 |            3 |     default |       #1.3 |
|     |         |            4 |     default |       #1.4 |
+-----+---------+--------------+-------------+------------+
|   3 |       3 |            5 |     default |       #1.5 |
|     |         |            6 |     default |       #1.6 |
+-----+---------+--------------+-------------+------------+

如何更改第二个查询以获得预期结果?

【问题讨论】:

【参考方案1】:

另一种选择是使用 CROSS JOIN 而不是 LEFT JOIN

#standardSQL
WITH Orders AS (
  SELECT 1 AS __oid__, ARRAY[STRUCT(STRUCT(1 AS id, "default" AS ns) AS key), STRUCT(STRUCT(2 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 2 AS __oid__, ARRAY[STRUCT(STRUCT(3 AS id, "default" AS ns) AS key), STRUCT(STRUCT(4 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 3 AS __oid__, ARRAY[STRUCT(STRUCT(5 AS id, "default" AS ns) AS key), STRUCT(STRUCT(6 AS id, "default" AS ns) AS key)] AS items
),
Items AS (
  SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
  SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
  SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
  SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
  SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
  SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)

SELECT
   __oid__
  ,ARRAY_AGG(Order_items) AS items
FROM Orders  

CROSS JOIN(
  SELECT
     key
    ,title
  FROM Items
) Order_items
WHERE Order_items.key.id IN (SELECT item.key.id FROM UNNEST(items) AS item)
GROUP BY __oid__

【讨论】:

虽然 Elliott 建议的解决方案确实返回了相同的结果,但对于此示例和我的生产数据,CROSS JOIN 方法的执行速度要快得多。所以我已将此答案标记为正确答案。【参考方案2】:

问题在于 BigQuery 无法从两侧对连接键进行哈希分区(因为连接表示为 IN 条件)。您可以通过展平左侧的数组然后从右侧聚合项目来完成这项工作:

#standardSQL
WITH Orders AS (
  SELECT 1 AS __oid__, ARRAY[STRUCT(STRUCT(1 AS id, "default" AS ns) AS key), STRUCT(STRUCT(2 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 2 AS __oid__, ARRAY[STRUCT(STRUCT(3 AS id, "default" AS ns) AS key), STRUCT(STRUCT(4 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 3 AS __oid__, ARRAY[STRUCT(STRUCT(5 AS id, "default" AS ns) AS key), STRUCT(STRUCT(6 AS id, "default" AS ns) AS key)] AS items
),
Items AS (
  SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
  SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
  SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
  SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
  SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
  SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)

SELECT
   __oid__
  ,ARRAY_AGG(Order_items) AS items
FROM Orders,
UNNEST(items) AS item

LEFT OUTER JOIN(
  SELECT
     key
    ,title
  FROM Items
) Order_items
ON Order_items.key.id = item.key.id
GROUP BY __oid__

无论如何,这看起来都是您想要的,因为您的原始查询将 items 作为一个结构而不是一个结构数组。

【讨论】:

以上是关于BigQuery:在标准 SQL 中使用重复/数组 STRUCT 字段加入?的主要内容,如果未能解决你的问题,请参考以下文章

BigQuery 标准 SQL:从表中删除重复项

在 BigQuery 中展平嵌套和重复的结构(标准 SQL)

在 Bigquery 中,如何使用标准 Sql 过滤 Struct 数组以匹配 Struct 中的多个字段?

BigQuery:使用标准 SQL 过滤重复字段

如何在 BigQuery 标准 SQL 中获取数组的切片?

如何在 BigQuery 标准 SQL 中取消嵌套多个数组