如何在不参考 BigQuery 中的父记录的情况下查询嵌套记录中的字段?
Posted
技术标签:
【中文标题】如何在不参考 BigQuery 中的父记录的情况下查询嵌套记录中的字段?【英文标题】:How to query on fields from nested records without referring to the parent records in BigQuery? 【发布时间】:2020-07-01 15:03:40 【问题描述】:我的数据结构如下:
"results":
"A": "first": 1, "second": 2, "third": 3,
"B": "first": 4, "second": 5, "third": 6,
"C": "first": 7, "second": 8, "third": 9,
"D": "first": 1, "second": 2, "third": 3,
... ,
...
即嵌套记录,其中最低级别与上面级别中的所有记录具有相同的架构。架构将与此类似:
results RECORD NULLABLE
results.A RECORD NULLABLE
results.A.first INTEGER NULLABLE
results.A.second INTEGER NULLABLE
results.A.third INTEGER NULLABLE
results.B RECORD NULLABLE
results.B.first INTEGER NULLABLE
...
有没有办法在 BigQuery 中对最低级别的字段进行(例如聚合)查询,而不知道(直接)父级别的键?换句话说,我可以在first
上查询results
中的所有记录,而不必在我的查询中指定A
、B
、...?
例如,我想实现类似的目标
SELECT SUM(results.*.first) FROM table
为了得到1+4+7+1 = 13
,
但不支持SELECT results.*.first
。
(我尝试过使用 STRUCT,但还没有走多远。)
【问题讨论】:
How to create a Minimal, Reproducible Example 表的架构还不清楚!它是带有json的字符串字段吗?还是重复记录?请提供架构。提供 WITH 语句以重现您的数据的最佳方式,以便我们可以有效地提供帮助 【参考方案1】:以下技巧适用于 BigQuery 标准 SQL
#standardSQL
SELECT id, (
SELECT AS STRUCT
SUM(first) AS sum_first,
SUM(second) AS sum_second,
SUM(third) AS sum_third
FROM UNNEST([a]||[b]||[c]||[d])
).*
FROM `project.dataset.table`,
UNNEST([results])
您可以使用您问题中的虚拟/样本数据进行测试,如以下示例所示
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 AS id, STRUCT(
STRUCT(1 AS first, 2 AS second, 3 AS third) AS A,
STRUCT(4 AS first, 5 AS second, 6 AS third) AS B,
STRUCT(7 AS first, 8 AS second, 9 AS third) AS C,
STRUCT(1 AS first, 2 AS second, 3 AS third) AS D
) AS results
)
SELECT id, (
SELECT AS STRUCT
SUM(first) AS sum_first,
SUM(second) AS sum_second,
SUM(third) AS sum_third
FROM UNNEST([a]||[b]||[c]||[d])
).*
FROM `project.dataset.table`,
UNNEST([results])
有输出
Row id sum_first sum_second sum_third
1 1 13 17 21
【讨论】:
【参考方案2】:有没有办法在 BigQuery 中对最低级别的字段进行(例如聚合)查询,而不知道(直接)父级别的键?
以下是 BigQuery 标准 SQL,完全避免引用父记录(A、B、C、D 等)
#standardSQL
CREATE TEMP FUNCTION Nested_SUM(entries ANY TYPE, field_name STRING) AS ((
SELECT SUM(CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64))
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(entries), r'":(.*?)')) entry,
UNNEST(SPLIT(entry)) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') = field_name
));
SELECT id,
Nested_SUM(results, 'first') AS first_sum,
Nested_SUM(results, 'second') AS second_sum,
Nested_SUM(results, 'third') AS third_sum,
Nested_SUM(results, 'forth') AS forth_sum
FROM `project.dataset.table`
如果应用到您的问题中的样本数据,如下例所示
#standardSQL
CREATE TEMP FUNCTION Nested_SUM(entries ANY TYPE, field_name STRING) AS ((
SELECT SUM(CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64))
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(entries), r'":(.*?)')) entry,
UNNEST(SPLIT(entry)) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') = field_name
));
WITH `project.dataset.table` AS (
SELECT 1 AS id, STRUCT(
STRUCT(1 AS first, 2 AS second, 3 AS third) AS A,
STRUCT(4 AS first, 5 AS second, 6 AS third) AS B,
STRUCT(7 AS first, 8 AS second, 9 AS third) AS C,
STRUCT(1 AS first, 2 AS second, 3 AS third) AS D
) AS results
)
SELECT id,
Nested_SUM(results, 'first') AS first_sum,
Nested_SUM(results, 'second') AS second_sum,
Nested_SUM(results, 'third') AS third_sum,
Nested_SUM(results, 'forth') AS forth_sum
FROM `project.dataset.table`
输出是
Row id first_sum second_sum third_sum forth_sum
1 1 13 17 21 null
【讨论】:
【参考方案3】:我修改了Mikhail's answer 以支持对最低级别字段的值进行分组:
#standardSQL
CREATE TEMP FUNCTION Nested_AGGREGATE(entries ANY TYPE, field_name STRING) AS ((
SELECT ARRAY(
SELECT AS STRUCT TRIM(SPLIT(kv, ':')[OFFSET(1)], '"') AS value, COUNT(SPLIT(kv, ':')[OFFSET(1)]) AS count
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(entries), r'":(.*?)')) entry,
UNNEST(SPLIT(entry)) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') = field_name
GROUP BY TRIM(SPLIT(kv, ':')[OFFSET(1)], '"')
)
));
SELECT id,
Nested_AGGREGATE(results, 'first') AS first_agg,
Nested_AGGREGATE(results, 'second') AS second_agg,
Nested_AGGREGATE(results, 'third') AS third_agg,
FROM `project.dataset.table`
WITH `project.dataset.table` AS (SELECT 1 AS id, STRUCT( STRUCT(1 AS first, 2 AS second, 3 AS third) AS A, STRUCT(4 AS first, 5 AS second, 6 AS third) AS B, STRUCT(7 AS first, 8 AS second, 9 AS third) AS C, STRUCT(1 AS first, 2 AS second, 3 AS third) AS D) AS results )
的输出:
Row id first_agg.value first_agg.count second_agg.value second_agg.count third_agg.value third_agg.count
1 1 1 2 2 2 3 2
4 1 5 1 6 1
7 1 8 1 9 1
【讨论】:
你可能想接受我各自的回答,因为它看起来真的很有帮助!以上是关于如何在不参考 BigQuery 中的父记录的情况下查询嵌套记录中的字段?的主要内容,如果未能解决你的问题,请参考以下文章
如何在不破坏我的结构的情况下将特定单元格排除到 BigQuery 中的数组数组中?
可以在不填充数据的情况下创建 BigQuery 表/架构吗?
在不使用表格的情况下从 BigQuery 中的 csv 文件中检索数据
BigQuery - 如何在不使用列名作为值的情况下导入 CSV?