在 BigQuery 中,将对象的字符串化数组转换为非字符串化

Posted

技术标签:

【中文标题】在 BigQuery 中,将对象的字符串化数组转换为非字符串化【英文标题】:In BigQuery, convert stringified array of objects into non-stringified 【发布时间】:2020-07-27 02:14:11 【问题描述】:

我正在将 .json 数据提取到 Google BigQuery 中,并且在提取时,来自 .jsonarraysobjects 的数据类型都被转换为 string 列。 BigQuery 中的数据如下所示:

select 1 as id, '[]' as stringCol1, '[]' as stringCol2 union all
select 2 as id, null as stringCol1, null as stringCol2 union all
select 3 as id, "'game': '22', 'year': 'sophomore'" as stringCol1, "['teamName': 'teamA', 'teamAge': 37, 'teamName': 'teamB', 'teamAge': 32]" as stringCol2 union all
select 4 as id, "'game': '17', 'year': 'freshman'" as stringCol1, "['teamName': 'teamA', 'teamAge': 32, 'teamName': 'teamB', 'teamAge': 33]" as stringCol2 union all
select 5 as id, "'game': '9', 'year': 'senior'" as stringCol1, "['teamName': 'teamC', 'teamAge': 31, 'teamName': 'teamD', 'teamAge': 17]" as stringCol2 union all
select 6 as id, "'game': '234', 'year': 'junior'" as stringCol1, "['teamName': 'teamC', 'teamAge': 42, 'teamName': 'teamD', 'teamAge': 25]" as stringCol2

数据有点乱。

stringCol1 中,有null'[]' 缺失数据的值。我想从这个字符串化对象创建两列 gameyear。 对于stringCol2,这始终是一个包含两个对象的数组,具有相同的键(teamNameteamAge,在这种情况下)。然后需要将其转换为 4 列 teamName1teamAge1teamName2teamAge2

This similar post 解决了将基本字符串化数组转换为非字符串化数组的问题,但这里的示例稍微复杂一些。特别是,其他帖子中的解决方案在这种情况下不起作用。

【问题讨论】:

【参考方案1】:

以下是 BigQuery 标准 SQL

#standardSQL
SELECT id,
  JSON_EXTRACT_SCALAR(stringCol1, '$.game') AS game,
  JSON_EXTRACT_SCALAR(stringCol1, '$.year') AS year,
  JSON_EXTRACT_SCALAR(t1, '$.teamName') AS teamName1,
  JSON_EXTRACT_SCALAR(t1, '$.teamAge') AS teamAge1,
  JSON_EXTRACT_SCALAR(t2, '$.teamName') AS teamName2,
  JSON_EXTRACT_SCALAR(t2, '$.teamAge') AS teamAge2
FROM `project.dataset.table`,
UNNEST([STRUCT(
  JSON_EXTRACT_ARRAY(stringCol2)[SAFE_OFFSET(0)] AS t1, 
  JSON_EXTRACT_ARRAY(stringCol2)[SAFE_OFFSET(1)] AS t2
)])   

如果适用于您问题中的示例数据

WITH `project.dataset.table` AS (
  SELECT 1 AS id, '[]' AS stringCol1, '[]' AS stringCol2 UNION ALL
  SELECT 2 AS id, NULL AS stringCol1, NULL AS stringCol2 UNION ALL
  SELECT 3 AS id, "'game': '22', 'year': 'sophomore'" AS stringCol1, "['teamName': 'teamA', 'teamAge': 37, 'teamName': 'teamB', 'teamAge': 32]" AS stringCol2 UNION ALL
  SELECT 4 AS id, "'game': '17', 'year': 'freshman'" AS stringCol1, "['teamName': 'teamA', 'teamAge': 32, 'teamName': 'teamB', 'teamAge': 33]" AS stringCol2 UNION ALL
  SELECT 5 AS id, "'game': '9', 'year': 'senior'" AS stringCol1, "['teamName': 'teamC', 'teamAge': 31, 'teamName': 'teamD', 'teamAge': 17]" AS stringCol2 UNION ALL
  SELECT 6 AS id, "'game': '234', 'year': 'junior'" AS stringCol1, "['teamName': 'teamC', 'teamAge': 42, 'teamName': 'teamD', 'teamAge': 25]" AS stringCol2
) 

输出是

Row id  game    year        teamName1   teamAge1    teamName2   teamAge2     
1   1   null    null        null        null        null        null     
2   2   null    null        null        null        null        null     
3   3   22      sophomore   teamA       37          teamB       32   
4   4   17      freshman    teamA       32          teamB       33   
5   5   9       senior      teamC       31          teamD       17   
6   6   234     junior      teamC       42          teamD       25      

上面可以有很多变体来提高例如可读性

#standardSQL
SELECT id,
  JSON_EXTRACT_SCALAR(stringCol1, '$.game') AS game,
  JSON_EXTRACT_SCALAR(stringCol1, '$.year') AS year,
  JSON_EXTRACT_SCALAR(t[SAFE_OFFSET(0)], '$.teamName') AS teamName1,
  JSON_EXTRACT_SCALAR(t[SAFE_OFFSET(0)], '$.teamAge') AS teamAge1,
  JSON_EXTRACT_SCALAR(t[SAFE_OFFSET(1)], '$.teamName') AS teamName2,
  JSON_EXTRACT_SCALAR(t[SAFE_OFFSET(1)], '$.teamAge') AS teamAge2
FROM `project.dataset.table`,
UNNEST([STRUCT(JSON_EXTRACT_ARRAY(stringCol2) AS t)])

【讨论】:

非常有帮助,谢谢。 json_extract_* 似乎是 BigQuery 中的一个强大功能

以上是关于在 BigQuery 中,将对象的字符串化数组转换为非字符串化的主要内容,如果未能解决你的问题,请参考以下文章

BigQuery 中未嵌套的 json 对象的字符串化数组

如何将 BigQuery Struct Schema 字符串转换为 Javascript 对象?

在 BigQuery 中取消嵌套 JSON 字符串化数组

将多列转换为 Bigquery 中的记录

将数组保存到 BigQuery

如何从 BigQuery JavaScript UDF 为字符串化几何集合中的每个要素创建几何?