BigQuery 使用从宽到长的结构重塑表

Posted

技术标签:

【中文标题】BigQuery 使用从宽到长的结构重塑表【英文标题】:BigQuery reshape table with structs from wide to long 【发布时间】:2019-11-09 05:13:34 【问题描述】:

删除了此帖子的先前版本,以代替此带有可重现示例的清理帖子。我有以下格式的表格:

WITH wide_stats AS (
  (
    SELECT 
      'joe' name, 'bills' team,
      struct(struct(7 as fga) as o, struct(8 as fga) as d) as t1,
      struct(struct(3 as fga) as o, struct(9 as fga) as d) as t2,
      struct(3 as pts, 9 as ast, 5 as reb) as t3,    
      7 tov, 3 blk
  ) UNION ALL (
    SELECT 'nick' name, 'jets' team,
      struct(struct(12 as fga) as o, struct(13 as fga) as d) as t1,
      struct(struct(15 as fga) as o, struct(22 as fga) as d) as t2,
      struct(13 as pts, 5 as ast, 15 as reb) as t3,    
      75 tov, 23 blk
  )
)

SELECT 
  name, team, metric, SAFE_CAST(value AS FLOAT64) value
FROM (
  SELECT 
    name, team, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
  FROM wide_stats,
  UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(wide_stats), r'|', ''))) pair
)

WHERE NOT LOWER(metric) IN ('name', 'team')

我正在努力将表格重塑为以下输出:

name   team       metric   value    
joe   bills     t1_o_fga       7 
joe   bills     t1_d_fga       8
joe   bills     t2_o_fga       3
joe   bills     t2_d_fga       9
joe   bills       t3_pts       3
joe   bills       t3_ast       9
joe   bills       t3_reb       5
joe   bills          tov       7
joe   bills          blk       3
nick   jets     t1_o_fga      12
nick   jets     t1_d_fga      13
nick   jets     t2_o_fga      15
nick   jets     t2_d_fga      22
nick   jets       t3_pts      13
nick   jets       t3_ast       5
nick   jets       t3_reb      15
nick   jets          tov      75
nick   jets          blk      23

这个任务很容易解释 - 从宽到长,但在表中使用 struct 和嵌套的 structs。我在另一个 stackoveflow 帖子中的正则表达式工作是以错误的方式拆分列名,并且当前输出与它需要的不匹配。

行的顺序无关紧要。有了名字,不管它是 t1_o_fga 还是 t1-o-fga 还是 t1/o/fga,只要有一些分隔符并且清楚变量是什么。非常感谢任何帮助或指导,谢谢!

【问题讨论】:

一个在不使用 UDF 的情况下模拟 pandas.melt() 函数的查询:***.com/a/62057616/13632099 【参考方案1】:

以下是 BigQuery 标准 SQL

#standardSQL
WITH wide_stats AS (
    SELECT 'joe' name, 'bills' team,
      STRUCT(STRUCT(7 AS fga) AS o, STRUCT(8 AS fga) AS d) AS t1,
      STRUCT(STRUCT(3 AS fga) AS o, STRUCT(9 AS fga) AS d) AS t2,
      STRUCT(3 AS pts, 9 AS ast, 5 AS reb) AS t3, 7 tov, 3 blk UNION ALL 
    SELECT 'nick' name, 'jets' team,
      STRUCT(STRUCT(12 AS fga) AS o, STRUCT(13 AS fga) AS d) AS t1,
      STRUCT(STRUCT(15 AS fga) AS o, STRUCT(22 AS fga) AS d) AS t2,
      STRUCT(13 AS pts, 5 AS ast, 15 AS reb) AS t3, 75 tov, 23 blk
), flat_stats AS (
  SELECT name, team,
    t1.o.fga AS t1_o_fga,
    t1.d.fga AS t1_d_fga,
    t2.o.fga AS t2_o_fga,
    t2.d.fga AS t2_d_fga,
    t3.pts AS t3_pts,
    t3.ast AS t3_ast,
    t3.reb AS t3_reb,
    tov, blk
  FROM wide_stats
)
SELECT name, team, metric, SAFE_CAST(value AS FLOAT64) value 
FROM (
  SELECT name, team, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
  FROM flat_stats, 
  UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(flat_stats), r'|', ''))) pair
)
WHERE NOT LOWER(metric) IN ('name', 'team')   

有输出

Row name    team    metric      value    
1   joe     bills   t1_o_fga    7.0  
2   joe     bills   t1_d_fga    8.0  
3   joe     bills   t2_o_fga    3.0  
4   joe     bills   t2_d_fga    9.0  
5   joe     bills   t3_pts      3.0  
6   joe     bills   t3_ast      9.0  
7   joe     bills   t3_reb      5.0  
8   joe     bills   tov         7.0  
9   joe     bills   blk         3.0  
10  nick    jets    t1_o_fga    12.0     
11  nick    jets    t1_d_fga    13.0     
12  nick    jets    t2_o_fga    15.0     
13  nick    jets    t2_d_fga    22.0     
14  nick    jets    t3_pts      13.0     
15  nick    jets    t3_ast      5.0  
16  nick    jets    t3_reb      15.0     
17  nick    jets    tov         75.0     
18  nick    jets    blk         23.0      

如果由于某种原因您在手动组装 flat_stats 临时表时遇到问题 - 您可以做一个小技巧,如下所示

第 1 步 - 只需在旧模式下使用目标表 [project:dataset.flat_stats] 运行以下查询

#legacySQL
SELECT *
FROM [project:dataset.wide_stats]    

“令人惊讶”,这将创建具有以下结构的表[project:dataset.flat_stats]

Row name    team    t1_o_fga    t1_d_fga    t2_o_fga    t2_d_fga    t3_pts  t3_ast  t3_reb  tov blk  
1   joe     bills   7           8           3           9           3       9       5       7   3    
2   nick    jets    12          13          15          22          13      5       15      75  23     

第 2 步 - 之后您可以简单地在下面运行(现在在标准 SQL 中)

#standardSQL
SELECT name, team, metric, SAFE_CAST(value AS FLOAT64) value 
FROM (
  SELECT name, team, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
  FROM `project.dataset.flat_stats` flat_stats, 
  UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(flat_stats), r'|', ''))) pair
)
WHERE NOT LOWER(metric) IN ('name', 'team')  

【讨论】:

再次感谢您为此和今天 - 帮助了很多

以上是关于BigQuery 使用从宽到长的结构重塑表的主要内容,如果未能解决你的问题,请参考以下文章

在大查询中从宽到长重塑(标准 SQL)

R语言使用reshape2包的melt函数将dataframe从宽表到长表(Wide- to long-format)如果没有指定行标识符号,则所有的字段都会放入variable变量中

R语言使用reshape2包的melt函数将dataframe从宽表到长表(Wide- to long-format)指定行标识符变量并自定义生成的长表的标识符列的名称

R,通过提取前缀从宽到长旋转。整齐划一

使用多个变量和一些时间不变将数据框从宽重塑为面板

将时间序列数据从宽格式重塑为高格式(用于绘图)[重复]