在 BigQuery 中,将对象的字符串化数组转换为非字符串化
Posted
技术标签:
【中文标题】在 BigQuery 中,将对象的字符串化数组转换为非字符串化【英文标题】:In BigQuery, convert stringified array of objects into non-stringified 【发布时间】:2020-07-27 02:14:11 【问题描述】:我正在将 .json
数据提取到 Google BigQuery 中,并且在提取时,来自 .json
的 arrays
和 objects
的数据类型都被转换为 string
列。 BigQuery 中的数据如下所示:
select 1 as id, '[]' as stringCol1, '[]' as stringCol2 union all
select 2 as id, null as stringCol1, null as stringCol2 union all
select 3 as id, "'game': '22', 'year': 'sophomore'" as stringCol1, "['teamName': 'teamA', 'teamAge': 37, 'teamName': 'teamB', 'teamAge': 32]" as stringCol2 union all
select 4 as id, "'game': '17', 'year': 'freshman'" as stringCol1, "['teamName': 'teamA', 'teamAge': 32, 'teamName': 'teamB', 'teamAge': 33]" as stringCol2 union all
select 5 as id, "'game': '9', 'year': 'senior'" as stringCol1, "['teamName': 'teamC', 'teamAge': 31, 'teamName': 'teamD', 'teamAge': 17]" as stringCol2 union all
select 6 as id, "'game': '234', 'year': 'junior'" as stringCol1, "['teamName': 'teamC', 'teamAge': 42, 'teamName': 'teamD', 'teamAge': 25]" as stringCol2
数据有点乱。
在stringCol1
中,有null
和'[]'
缺失数据的值。我想从这个字符串化对象创建两列 game
和 year
。
对于stringCol2
,这始终是一个包含两个对象的数组,具有相同的键(teamName
和teamAge
,在这种情况下)。然后需要将其转换为 4 列 teamName1
、teamAge1
、teamName2
、teamAge2
。
This similar post 解决了将基本字符串化数组转换为非字符串化数组的问题,但这里的示例稍微复杂一些。特别是,其他帖子中的解决方案在这种情况下不起作用。
【问题讨论】:
【参考方案1】:以下是 BigQuery 标准 SQL
#standardSQL
SELECT id,
JSON_EXTRACT_SCALAR(stringCol1, '$.game') AS game,
JSON_EXTRACT_SCALAR(stringCol1, '$.year') AS year,
JSON_EXTRACT_SCALAR(t1, '$.teamName') AS teamName1,
JSON_EXTRACT_SCALAR(t1, '$.teamAge') AS teamAge1,
JSON_EXTRACT_SCALAR(t2, '$.teamName') AS teamName2,
JSON_EXTRACT_SCALAR(t2, '$.teamAge') AS teamAge2
FROM `project.dataset.table`,
UNNEST([STRUCT(
JSON_EXTRACT_ARRAY(stringCol2)[SAFE_OFFSET(0)] AS t1,
JSON_EXTRACT_ARRAY(stringCol2)[SAFE_OFFSET(1)] AS t2
)])
如果适用于您问题中的示例数据
WITH `project.dataset.table` AS (
SELECT 1 AS id, '[]' AS stringCol1, '[]' AS stringCol2 UNION ALL
SELECT 2 AS id, NULL AS stringCol1, NULL AS stringCol2 UNION ALL
SELECT 3 AS id, "'game': '22', 'year': 'sophomore'" AS stringCol1, "['teamName': 'teamA', 'teamAge': 37, 'teamName': 'teamB', 'teamAge': 32]" AS stringCol2 UNION ALL
SELECT 4 AS id, "'game': '17', 'year': 'freshman'" AS stringCol1, "['teamName': 'teamA', 'teamAge': 32, 'teamName': 'teamB', 'teamAge': 33]" AS stringCol2 UNION ALL
SELECT 5 AS id, "'game': '9', 'year': 'senior'" AS stringCol1, "['teamName': 'teamC', 'teamAge': 31, 'teamName': 'teamD', 'teamAge': 17]" AS stringCol2 UNION ALL
SELECT 6 AS id, "'game': '234', 'year': 'junior'" AS stringCol1, "['teamName': 'teamC', 'teamAge': 42, 'teamName': 'teamD', 'teamAge': 25]" AS stringCol2
)
输出是
Row id game year teamName1 teamAge1 teamName2 teamAge2
1 1 null null null null null null
2 2 null null null null null null
3 3 22 sophomore teamA 37 teamB 32
4 4 17 freshman teamA 32 teamB 33
5 5 9 senior teamC 31 teamD 17
6 6 234 junior teamC 42 teamD 25
上面可以有很多变体来提高例如可读性
#standardSQL
SELECT id,
JSON_EXTRACT_SCALAR(stringCol1, '$.game') AS game,
JSON_EXTRACT_SCALAR(stringCol1, '$.year') AS year,
JSON_EXTRACT_SCALAR(t[SAFE_OFFSET(0)], '$.teamName') AS teamName1,
JSON_EXTRACT_SCALAR(t[SAFE_OFFSET(0)], '$.teamAge') AS teamAge1,
JSON_EXTRACT_SCALAR(t[SAFE_OFFSET(1)], '$.teamName') AS teamName2,
JSON_EXTRACT_SCALAR(t[SAFE_OFFSET(1)], '$.teamAge') AS teamAge2
FROM `project.dataset.table`,
UNNEST([STRUCT(JSON_EXTRACT_ARRAY(stringCol2) AS t)])
【讨论】:
非常有帮助,谢谢。json_extract_*
似乎是 BigQuery 中的一个强大功能以上是关于在 BigQuery 中,将对象的字符串化数组转换为非字符串化的主要内容,如果未能解决你的问题,请参考以下文章