将 BigQuery 嵌套字段内容展平为新列而不是行
Posted
技术标签:
【中文标题】将 BigQuery 嵌套字段内容展平为新列而不是行【英文标题】:Flatten BigQuery nested field contents into new columns instead of rows 【发布时间】:2016-08-08 22:41:56 【问题描述】:我有一些格式如下的 BigQuery 数据:
"thing": [
"name": "gameLost",
"params": [
"key": "total_games",
"val":
"str_val": "3",
"int_val": null
,
"key": "games_won",
"val":
"str_val": "2",
"int_val": null
,
"key": "game_time",
"val":
"str_val": "44",
"int_val": null
],
"dt_a": "1470625311138000",
"dt_b": "1470620345566000"
我知道 FLATTEN() 函数会产生 3 行的输出,如下所示:
+------------+------------------+------------------+--------------------+--------------------------+--------------------------+
| thing.name | thing.dt_a | event_dim.dt_b | thing.params.key | thing.params.val.str_val | thing.params.val.int_val |
+------------+------------------+------------------+--------------------+--------------------------+--------------------------+
| gameLost | 1470625311138000 | 1470620345566000 | total_games_played | 3 | null |
| | | | | | |
| gameLost | 1470625311138000 | 1470620345566000 | games_won | 2 | null |
| | | | | | |
| gameLost | 1470625311138000 | 1470620345566000 | game_time | 44 | null |
+------------+------------------+------------------+--------------------+--------------------------+--------------------------+
更高级别的键/值被重复到每个更深层次对象的新行中。
但是,我需要将更深层次的键/值输出为全新的列,而不是重复字段,因此结果如下所示:
+------------+------------------+------------------+--------------------+-----------+-----------+
| thing.name | thing.dt_a | event_dim.dt_b | total_games_played | games_won | game_time |
+------------+------------------+------------------+--------------------+-----------+-----------+
| gameLost | 1470625311138000 | 1470620345566000 | 3 | 2 | 44 |
+------------+------------------+------------------+--------------------+-----------+-----------+
我该怎么做? 谢谢!
【问题讨论】:
【参考方案1】:Standard SQL 使这更容易表达(取消选中“显示选项”下的“使用旧版 SQL”):
WITH T AS (
SELECT STRUCT(
"gameLost" AS name,
ARRAY<STRUCT<key STRING, val STRUCT<str_val STRING, int_val INT64>>>[
STRUCT("total_games", STRUCT("3", NULL)),
STRUCT("games_won", STRUCT("2", NULL)),
STRUCT("game_time", STRUCT("44", NULL))] AS params,
1470625311138000 AS dt_a,
1470620345566000 AS dt_b) AS thing
)
SELECT
(SELECT AS STRUCT thing.* EXCEPT (params)) AS thing,
thing.params[OFFSET(0)].val.str_val AS total_games_played,
thing.params[OFFSET(1)].val.str_val AS games_won,
thing.params[OFFSET(2)].val.str_val AS game_time
FROM T;
+-------------------------------------------------------------------------+--------------------+-----------+-----------+
| thing | total_games_played | games_won | game_time |
+-------------------------------------------------------------------------+--------------------+-----------+-----------+
| "name":"gameLost","dt_a":"1470625311138000","dt_b":"1470620345566000" | 3 | 2 | 44 |
+-------------------------------------------------------------------------+--------------------+-----------+-----------+
如果不知道数组中键的顺序,可以使用子选择来提取相关值:
WITH T AS (
SELECT STRUCT(
"gameLost" AS name,
ARRAY<STRUCT<key STRING, val STRUCT<str_val STRING, int_val INT64>>>[
STRUCT("total_games", STRUCT("3", NULL)),
STRUCT("games_won", STRUCT("2", NULL)),
STRUCT("game_time", STRUCT("44", NULL))] AS params,
1470625311138000 AS dt_a,
1470620345566000 AS dt_b) AS thing
)
SELECT
(SELECT AS STRUCT thing.* EXCEPT (params)) AS thing,
(SELECT val.str_val FROM UNNEST(thing.params) WHERE key = "total_games") AS total_games_played,
(SELECT val.str_val FROM UNNEST(thing.params) WHERE key = "games_won") AS games_won,
(SELECT val.str_val FROM UNNEST(thing.params) WHERE key = "game_time") AS game_time
FROM T;
【讨论】:
喜欢标准 SQL 的新特性!!!真的会!同时,我认为您不能依靠 order/offset 来检索键的值-除非保证键按特定顺序排列-在我的实践中通常不是这样 谢谢!我更新了我的答案,以说明这也是如何工作的。 感谢@MikhailBerlyant 和@ElliottBrossard!你们让我开始了,但是在尝试应用到我更复杂的数据源时遇到了更多问题。我在这里打开了一个新的相关问题:***.com/questions/38860534/…【参考方案2】:试试下面(旧版 SQL)
SELECT
thing.name AS name,
thing.dt_a AS dt_a,
thing.dt_b AS dt_b
MAX(IF(thing.params.key = "total_games_played", INTEGER(thing.params.val.str_val), 0)) WITHIN RECORD AS total_games_played,
MAX(IF(thing.params.key = "games_won", INTEGER(thing.params.val.str_val), 0)) WITHIN RECORD AS games_won,
MAX(IF(thing.params.key = "game_time", INTEGER(thing.params.val.str_val), 0)) WITHIN RECORD AS game_time,
FROM YourTable
对于标准 SQL,您可以尝试(受 Elliott 的回答启发 - 重要区别 - 数组按键排序,因此键值的顺序得到保证)
WITH Temp AS (
SELECT
(SELECT AS STRUCT thing.* EXCEPT (params)) AS thing,
ARRAY(SELECT val.str_val AS val FROM UNNEST(thing.params) ORDER BY key) AS params
FROM YourTable
)
SELECT
thing,
params[OFFSET(2)] AS total_games_played,
params[OFFSET(1)] AS games_won,
params[OFFSET(0)] AS game_time
FROM Temp
注意:如果参数中有其他键 - 你应该在 ARRAY 中添加 WHERE 子句到 SELECT
【讨论】:
以上是关于将 BigQuery 嵌套字段内容展平为新列而不是行的主要内容,如果未能解决你的问题,请参考以下文章