大查询 - 将数组/json 对象转置为列
Posted
技术标签:
【中文标题】大查询 - 将数组/json 对象转置为列【英文标题】:Big Query - Transpose array/json objects into columns 【发布时间】:2020-10-21 03:52:05 【问题描述】:这个问题是这两个问题的延续:
-
Big Query - Transpose arrays into colums
Big Query - Transpose Specific fields into Columns
我们在 Big Query 中有一个如下表。
输入表:
Name | Question | Answer
-----+-----------+-------
Bob | Interest | ["a"]
Sue | Interest | ["a", "b"]
Joe | Interest | ["b"]
Joe | Gender | Male
Bob | Gender | Female
Sue | DOB | 2020-10-17
Bob | Others | "country" : "es", "language" : "ca"
注意: Answer 列中的所有值都是字符串化的值,Arrays / JSON 对象是动态的。
我们希望将上表转换为以下格式,使其对 BI/Visualisation 友好。
所需的表:
+-------------------------------------------------------------+
| Name | a | b | c | Gender | DOB | country | language |
+-------------------------------------------------------------+
| Bob | 1 | 0 | 0 | Female | 2020-10-17 | es | ca |
| Sue | 1 | 1 | 0 | - | - | - | - |
| Joe | 0 | 1 | 0 | Male | - | - | - |
+-------------------------------------------------------------+
【问题讨论】:
您至少自己尝试过一些东西吗?您已经回答了几乎所有问题,只需要一点额外的努力!那你试过了吗?你遇到了什么问题? @mikhail 我可以使用 JSON_EXTRACT 函数提取 JSON 值。但是动态提取它们并将它们转换为单独的列是我卡住的地方。 我明白了。无论如何 - 看看答案! 【参考方案1】:以下是 BigQuery 标准 SQL
#standardSQL
create temp table data as
select name, question, value as answer
from `project.dataset.table`,
unnest(split(translate(answer, '[]" ', ''))) value
where question = 'Interest'
union all
select name, question, answer
from `project.dataset.table`
where not question in ('Interest', 'Others')
union all
select name,
split(value, ':')[offset(0)] as question,
split(value, ':')[offset(1)] as answer
from `project.dataset.table`,
unnest(split(translate(answer, '" ', ''))) value
where question = 'Others';
EXECUTE IMMEDIATE (
SELECT """
SELECT name, """ || STRING_AGG("""MAX(IF(answer = '""" || value || """', 1, 0)) AS """ || value, ', ')
FROM (
SELECT DISTINCT answer value FROM data
WHERE question = 'Interest' ORDER BY value
)) || (
SELECT ", " || STRING_AGG("""MAX(IF(question = '""" || value || """', answer, '-')) AS """ || value, ', ')
FROM (
SELECT DISTINCT question value FROM data
WHERE question != 'Interest' ORDER BY value
)) || """
FROM data
GROUP BY name
""";
如果适用于您问题中的样本数据
with `project.dataset.table` AS (
select 'Bob' name, 'Interest' question, '["a"]' answer union all
select 'Sue', 'Interest', '["a", "b"]' union all
select 'Joe', 'Interest', '["b"]' union all
select 'Joe', 'Gender', 'Male' union all
select 'Bob', 'Gender', 'Female' union all
select 'Sue', 'DOB', '2020-10-17' union all
select 'Bob', 'Others', ' "country" : "es", "language" : "ca"'
)
输出是
注意:上述脚本的EXECUTE IMMEDIATE
部分与上一篇完全相同——变化仅在于将原始数据准备到临时表data
中,而不是在EXECUTE IMMEDIATE
中使用它
【讨论】:
上面的查询给了我预期的结果。但是,在“其他”情况下(JSON 对象),一些值是空的 JSON 字符串,例如
。因此,偏移量会引发Array index X is out of bounds (overflow)
错误。所以我用 SAFE_OFFSET() 替换了它,现在它工作正常。如何在当前查询中添加条件以忽略空值?以上是关于大查询 - 将数组/json 对象转置为列的主要内容,如果未能解决你的问题,请参考以下文章