将 BigQuery 嵌套字段内容展平为新列而不是行

Posted

技术标签:

【中文标题】将 BigQuery 嵌套字段内容展平为新列而不是行【英文标题】:Flatten BigQuery nested field contents into new columns instead of rows 【发布时间】:2016-08-08 22:41:56 【问题描述】:

我有一些格式如下的 BigQuery 数据:

"thing": [
  
    "name": "gameLost",
    "params": [
      
        "key": "total_games",
        "val": 
          "str_val": "3",
          "int_val": null
        
      ,
      
        "key": "games_won",
        "val": 
          "str_val": "2",
          "int_val": null
        
      ,
      
        "key": "game_time",
        "val": 
          "str_val": "44",
          "int_val": null
        
      
    ],
    "dt_a": "1470625311138000",
    "dt_b": "1470620345566000"
  

我知道 FLATTEN() 函数会产生 3 行的输出,如下所示:

+------------+------------------+------------------+--------------------+--------------------------+--------------------------+
| thing.name | thing.dt_a       | event_dim.dt_b   | thing.params.key   | thing.params.val.str_val | thing.params.val.int_val |
+------------+------------------+------------------+--------------------+--------------------------+--------------------------+
| gameLost   | 1470625311138000 | 1470620345566000 | total_games_played | 3                        | null                     |
|            |                  |                  |                    |                          |                          |
| gameLost   | 1470625311138000 | 1470620345566000 | games_won          | 2                        | null                     |
|            |                  |                  |                    |                          |                          |
| gameLost   | 1470625311138000 | 1470620345566000 | game_time          | 44                       | null                     |
+------------+------------------+------------------+--------------------+--------------------------+--------------------------+

更高级别的键/值被重复到每个更深层次对象的新行中。

但是,我需要将更深层次的键/值输出为全新的列,而不是重复字段,因此结果如下所示:

+------------+------------------+------------------+--------------------+-----------+-----------+
| thing.name | thing.dt_a       | event_dim.dt_b   | total_games_played | games_won | game_time |
+------------+------------------+------------------+--------------------+-----------+-----------+
| gameLost   | 1470625311138000 | 1470620345566000 | 3                  | 2         | 44        |
+------------+------------------+------------------+--------------------+-----------+-----------+

我该怎么做? 谢谢!

【问题讨论】:

【参考方案1】:

Standard SQL 使这更容易表达(取消选中“显示选项”下的“使用旧版 SQL”):

WITH T AS (
  SELECT STRUCT(
    "gameLost" AS name,
    ARRAY<STRUCT<key STRING, val STRUCT<str_val STRING, int_val INT64>>>[
      STRUCT("total_games", STRUCT("3", NULL)),
      STRUCT("games_won", STRUCT("2", NULL)),
      STRUCT("game_time", STRUCT("44", NULL))] AS params,
    1470625311138000 AS dt_a,
    1470620345566000 AS dt_b) AS thing
)
SELECT
  (SELECT AS STRUCT thing.* EXCEPT (params)) AS thing,
  thing.params[OFFSET(0)].val.str_val AS total_games_played,
  thing.params[OFFSET(1)].val.str_val AS games_won,
  thing.params[OFFSET(2)].val.str_val AS game_time
FROM T;
+-------------------------------------------------------------------------+--------------------+-----------+-----------+
|                                  thing                                  | total_games_played | games_won | game_time |
+-------------------------------------------------------------------------+--------------------+-----------+-----------+
| "name":"gameLost","dt_a":"1470625311138000","dt_b":"1470620345566000" | 3                  | 2         | 44        |
+-------------------------------------------------------------------------+--------------------+-----------+-----------+

如果不知道数组中键的顺序,可以使用子选择来提取相关值:

WITH T AS (
  SELECT STRUCT(
    "gameLost" AS name,
    ARRAY<STRUCT<key STRING, val STRUCT<str_val STRING, int_val INT64>>>[
      STRUCT("total_games", STRUCT("3", NULL)),
      STRUCT("games_won", STRUCT("2", NULL)),
      STRUCT("game_time", STRUCT("44", NULL))] AS params,
    1470625311138000 AS dt_a,
    1470620345566000 AS dt_b) AS thing
)
SELECT
  (SELECT AS STRUCT thing.* EXCEPT (params)) AS thing,
  (SELECT val.str_val FROM UNNEST(thing.params) WHERE key = "total_games") AS total_games_played,
  (SELECT val.str_val FROM UNNEST(thing.params) WHERE key = "games_won") AS games_won,
  (SELECT val.str_val FROM UNNEST(thing.params) WHERE key = "game_time") AS game_time
FROM T;

【讨论】:

喜欢标准 SQL 的新特性!!!真的会!同时,我认为您不能依靠 order/offset 来检索键的值-除非保证键按特定顺序排列-在我的实践中通常不是这样 谢谢!我更新了我的答案,以说明这也是如何工作的。 感谢@MikhailBerlyant 和@ElliottBrossard!你们让我开始了,但是在尝试应用到我更复杂的数据源时遇到了更多问题。我在这里打开了一个新的相关问题:***.com/questions/38860534/…【参考方案2】:

试试下面(旧版 SQL)

SELECT 
  thing.name AS name,
  thing.dt_a AS dt_a,
  thing.dt_b AS dt_b
  MAX(IF(thing.params.key = "total_games_played", INTEGER(thing.params.val.str_val), 0)) WITHIN RECORD AS total_games_played,
  MAX(IF(thing.params.key = "games_won", INTEGER(thing.params.val.str_val), 0)) WITHIN RECORD AS games_won,
  MAX(IF(thing.params.key = "game_time", INTEGER(thing.params.val.str_val), 0)) WITHIN RECORD AS game_time,
FROM YourTable  

对于标准 SQL,您可以尝试(受 Elliott 的回答启发 - 重要区别 - 数组按键排序,因此键值的顺序得到保证)

WITH Temp AS (
  SELECT 
    (SELECT AS STRUCT thing.* EXCEPT (params)) AS thing,
    ARRAY(SELECT val.str_val AS val FROM UNNEST(thing.params) ORDER BY key) AS params
  FROM YourTable
)
SELECT 
  thing, 
  params[OFFSET(2)] AS total_games_played,
  params[OFFSET(1)] AS games_won,
  params[OFFSET(0)] AS game_time
FROM Temp 

注意:如果参数中有其他键 - 你应该在 ARRAY 中添加 WHERE 子句到 SELECT

【讨论】:

以上是关于将 BigQuery 嵌套字段内容展平为新列而不是行的主要内容,如果未能解决你的问题,请参考以下文章

在 BigQuery 中展平嵌套层次结构

Bigquery:UNNEST 重复与展平表性能

在 BigQuery 中展平嵌套和重复的结构(标准 SQL)

在 BigQuery 上展平多个嵌套数组

将字符串拆分为新列[重复]

BigQuery 在同一查询中展平 GA 会话和命中级别字段