BigQuery:在标准 SQL 中使用重复/数组 STRUCT 字段加入?
Posted
技术标签:
【中文标题】BigQuery:在标准 SQL 中使用重复/数组 STRUCT 字段加入?【英文标题】:BigQuery: JOIN ON with repeated / array STRUCT field in Standard SQL? 【发布时间】:2018-07-02 12:46:30 【问题描述】:我基本上有两张桌子,Orders
和 Items
。由于这些表是从 Google Cloud Datastore 备份文件导入的,因此引用不是通过简单的 ID 字段进行的,而是通过 <STRUCT>
进行一对一关系,其中其 id
字段表示我想要匹配的实际唯一 ID .对于一对多关系(REPEATED),架构使用<STRUCT>
的数组。
我可以使用 LEFT OUTER JOIN 查询一对一的关系,我也知道如何加入非重复结构和重复的字符串或 int,但我无法使用重复结构。
一个订单,一个项目:
#standardSQL
WITH Orders AS (
SELECT 1 AS __oid__, STRUCT(STRUCT(2 AS id, "default" AS ns) AS key) AS item UNION ALL
SELECT 2 AS __oid__, STRUCT(STRUCT(4 AS id, "default" AS ns) AS key) AS item UNION ALL
SELECT 3 AS __oid__, STRUCT(STRUCT(6 AS id, "default" AS ns) AS key) AS item
),
Items AS (
SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)
SELECT
__oid__
,Order_item AS item
FROM Orders
LEFT OUTER JOIN(
SELECT
key
,title
FROM Items
) Order_item
ON Order_item.key.id = item.key.id
结果(按预期工作):
+-----+---------+--------------+-------------+------------+
| Row | __oid__ | item.key.id | item.key.ns | item.title |
+-----+---------+--------------+-------------+------------+
| 1 | 1 | 2 | default | #1.2 |
+-----+---------+--------------+-------------+------------+
| 2 | 2 | 4 | default | #1.4 |
+-----+---------+--------------+-------------+------------+
| 3 | 3 | 6 | default | #1.6 |
+-----+---------+--------------+-------------+------------+
类似的查询,但这次是一个带有 许多 项的订单:
#standardSQL
WITH Orders AS (
SELECT 1 AS __oid__, ARRAY[STRUCT(STRUCT(1 AS id, "default" AS ns) AS key), STRUCT(STRUCT(2 AS id, "default" AS ns) AS key)] AS items UNION ALL
SELECT 2 AS __oid__, ARRAY[STRUCT(STRUCT(3 AS id, "default" AS ns) AS key), STRUCT(STRUCT(4 AS id, "default" AS ns) AS key)] AS items UNION ALL
SELECT 3 AS __oid__, ARRAY[STRUCT(STRUCT(5 AS id, "default" AS ns) AS key), STRUCT(STRUCT(6 AS id, "default" AS ns) AS key)] AS items
),
Items AS (
SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)
SELECT
__oid__
,Order_items AS items
FROM Orders
LEFT OUTER JOIN(
SELECT
key
,title
FROM Items
) Order_items
ON Order_items.key.id IN (SELECT item.key.id FROM UNNEST(items) AS item)
错误:在连接谓词中不支持 IN 子查询。
我其实预料到了这个结果:
+-----+---------+--------------+-------------+------------+
| Row | __oid__ | item.key.id | item.key.ns | item.title |
+-----+---------+--------------+-------------+------------+
| 1 | 1 | 1 | default | #1.1 |
| | | 2 | default | #1.2 |
+-----+---------+--------------+-------------+------------+
| 2 | 2 | 3 | default | #1.3 |
| | | 4 | default | #1.4 |
+-----+---------+--------------+-------------+------------+
| 3 | 3 | 5 | default | #1.5 |
| | | 6 | default | #1.6 |
+-----+---------+--------------+-------------+------------+
如何更改第二个查询以获得预期结果?
【问题讨论】:
【参考方案1】:另一种选择是使用 CROSS JOIN 而不是 LEFT JOIN
#standardSQL
WITH Orders AS (
SELECT 1 AS __oid__, ARRAY[STRUCT(STRUCT(1 AS id, "default" AS ns) AS key), STRUCT(STRUCT(2 AS id, "default" AS ns) AS key)] AS items UNION ALL
SELECT 2 AS __oid__, ARRAY[STRUCT(STRUCT(3 AS id, "default" AS ns) AS key), STRUCT(STRUCT(4 AS id, "default" AS ns) AS key)] AS items UNION ALL
SELECT 3 AS __oid__, ARRAY[STRUCT(STRUCT(5 AS id, "default" AS ns) AS key), STRUCT(STRUCT(6 AS id, "default" AS ns) AS key)] AS items
),
Items AS (
SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)
SELECT
__oid__
,ARRAY_AGG(Order_items) AS items
FROM Orders
CROSS JOIN(
SELECT
key
,title
FROM Items
) Order_items
WHERE Order_items.key.id IN (SELECT item.key.id FROM UNNEST(items) AS item)
GROUP BY __oid__
【讨论】:
虽然 Elliott 建议的解决方案确实返回了相同的结果,但对于此示例和我的生产数据,CROSS JOIN 方法的执行速度要快得多。所以我已将此答案标记为正确答案。【参考方案2】:问题在于 BigQuery 无法从两侧对连接键进行哈希分区(因为连接表示为 IN 条件)。您可以通过展平左侧的数组然后从右侧聚合项目来完成这项工作:
#standardSQL
WITH Orders AS (
SELECT 1 AS __oid__, ARRAY[STRUCT(STRUCT(1 AS id, "default" AS ns) AS key), STRUCT(STRUCT(2 AS id, "default" AS ns) AS key)] AS items UNION ALL
SELECT 2 AS __oid__, ARRAY[STRUCT(STRUCT(3 AS id, "default" AS ns) AS key), STRUCT(STRUCT(4 AS id, "default" AS ns) AS key)] AS items UNION ALL
SELECT 3 AS __oid__, ARRAY[STRUCT(STRUCT(5 AS id, "default" AS ns) AS key), STRUCT(STRUCT(6 AS id, "default" AS ns) AS key)] AS items
),
Items AS (
SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)
SELECT
__oid__
,ARRAY_AGG(Order_items) AS items
FROM Orders,
UNNEST(items) AS item
LEFT OUTER JOIN(
SELECT
key
,title
FROM Items
) Order_items
ON Order_items.key.id = item.key.id
GROUP BY __oid__
无论如何,这看起来都是您想要的,因为您的原始查询将 items
作为一个结构而不是一个结构数组。
【讨论】:
以上是关于BigQuery:在标准 SQL 中使用重复/数组 STRUCT 字段加入?的主要内容,如果未能解决你的问题,请参考以下文章
在 BigQuery 中展平嵌套和重复的结构(标准 SQL)