在大查询中从宽到长重塑(标准 SQL)

Posted

技术标签:

【中文标题】在大查询中从宽到长重塑(标准 SQL)【英文标题】:Reshape from wide to long in big query (standard SQL) 【发布时间】:2017-12-05 10:02:31 【问题描述】:

不幸的是,在 BQ 中重塑它不像在 R 中那么容易,而且我无法为这个项目导出我的数据。

这里是输入

date    country A             B         C      D
20170928    CH  3000.3        121       13     3200
20170929    CH  2800.31       137       23     1614.31

预期输出

date    country Metric  Value  
20170928    CH  A       3000.3  
20170928    CH  B       121     
20170928    CH  C       13     
20170928    CH  D       3200
20170929    CH  A       2800.31 
20170929    CH  B       137       
20170929    CH  C       23     
20170929    CH  D       1614.31

我的表格还有更多的列和行(但我认为需要很多手册)

【问题讨论】:

【参考方案1】:

以下适用于 BigQuery 标准 SQL,不需要重复选择,具体取决于列数。它会选择尽可能多的并将它们转换为指标和值

#standardSQL
SELECT DATE, country,
  metric, SAFE_CAST(value AS FLOAT64) value
FROM (
  SELECT DATE, country, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value 
  FROM `project.dataset.yourtable` t, 
  UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'|', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')

您可以像在您的问题中那样使用虚拟数据测试/玩上面的内容

#standardSQL
WITH `project.dataset.yourtable` AS (
  SELECT '20170928' DATE, 'CH' country, 3000.3 A, 121 B, 13 C, 3200 D UNION ALL
  SELECT '20170929', 'CH', 2800.31, 137, 23, 1614.31
)
SELECT DATE, country,
  metric, SAFE_CAST(value AS FLOAT64) value
FROM (
  SELECT DATE, country, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value 
  FROM `project.dataset.yourtable` t, 
  UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'|', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')

结果符合预期

DATE        country metric  value    
20170928    CH      A       3000.3   
20170928    CH      B       121.0    
20170928    CH      C       13.0     
20170928    CH      D       3200.0   
20170929    CH      A       2800.31  
20170929    CH      B       137.0    
20170929    CH      C       23.0     
20170929    CH      D       1614.31  

【讨论】:

如果任何值是包含空格的字符串,这将不起作用。 当然。它没有 - 因为这不是这篇文章的情况。您需要空间解决方案吗? 您在哪里看到有空格的可能性?国家代码?不!列名 - 不!那么你的评论有什么意义,为什么要投反对票?! 很遗憾,我现在无法收回投票。首先它的原因是问题标题非常通用,并且没有仅指定数字数据。一个通用的问题让我期待一个通用的答案,但事实并非如此。【参考方案2】:

您需要UNION,在 bigquery 中使用逗号表示

SELECT date, country, Metric, Value
FROM (
  SELECT date, country, 'A' as Metric,  A as Value FROM your_table
), (
  SELECT date, country, 'B' as Metric,  B as Value FROM your_table
), (
  SELECT date, country, 'C' as Metric,  C as Value FROM your_table
) , (
  SELECT date, country, 'D' as Metric,  D as Value FROM your_table
)

【讨论】:

这行得通,但对于标准 SQL,您实际上需要使用 UNION ALL,因为 coma 是交叉连接。谢谢! @AlienDeg 如果date 是唯一的(我相信它是唯一的),那么UNIONUNION ALL 之间将没有区别。欢迎你:) 我的意思是 UNION 在标准 SQL 中不使用逗号表示,仅在旧版中。【参考方案3】:

我设法找到的大多数答案都需要指定要融化的每一列的名称。当我在表中有数百到数千列时,这很难处理。这是一个适用于任意宽表的答案。

它利用动态 SQL 并自动从数据模式中提取多个列名,整理命令字符串,然后评估该字符串。这是为了模仿 Python pandas.melt() / R reshape2::melt() 行为。

由于 UDF 的一些不良属性,我故意不创建用户定义的函数。根据您的使用方式,您可能想要也可能不想这样做。

输入:

id0 id1 _2020_05_27 _2020_05_28
1   1   11          12
1   2   13          14
2   1   15          16
2   2   17          18

输出:

id0 id1 date         value
1   2   _2020_05_27  13
1   2   _2020_05_28  14
2   2   _2020_05_27  17
2   2   _2020_05_28  18
1   1   _2020_05_27  11
1   1   _2020_05_28  12
2   1   _2020_05_27  15
2   1   _2020_05_28  16
#standardSQL

-- PANDAS MELT FUNCTION IN GOOGLE BIGQUERY
-- author: Luna Huang
-- email: lunahuang@google.com

-- run this script with Google BigQuery Web UI in the Cloud Console

-- this piece of code functions like the pandas melt function
-- pandas.melt(id_vars, value_vars, var_name, value_name, col_level=None)
-- without utilizing user defined functions (UDFs)
-- see below for where to input corresponding arguments

DECLARE cmd STRING;
DECLARE subcmd STRING;
SET cmd = ("""
  WITH original AS (
    -- query to retrieve the original table
    %s
  ),
  nested AS (
    SELECT
    [
      -- sub command to be automatically generated
      %s
    ] as s,
    -- equivalent to id_vars in pandas.melt()
    %s,
    FROM original
  )
  SELECT
    -- equivalent to id_vars in pandas.melt()
    %s,
    -- equivalent to var_name in pandas.melt()
    s.key AS %s,
    -- equivalent to value_name in pandas.melt()
    s.value AS %s,
  FROM nested
  CROSS JOIN UNNEST(nested.s) AS s
""");
SET subcmd = ("""
  WITH
  columns AS (
    -- query to retrieve the column names
    -- equivalent to value_vars in pandas.melt()
    -- the resulting table should have only one column
    -- with the name: column_name
    %s
  ),
  scs AS (
    SELECT FORMAT("STRUCT('%%s' as key, %%s as value)", column_name, column_name) AS sc
    FROM columns
  )
  SELECT ARRAY_TO_STRING(ARRAY (SELECT sc FROM scs), ",\\n")
""");

-- -- -- EXAMPLE BELOW -- -- --

-- SET UP AN EXAMPLE TABLE --
CREATE OR REPLACE TABLE `tmp.example`
(
  id0 INT64,
  id1 INT64,
  _2020_05_27 INT64,
  _2020_05_28 INT64,
);
INSERT INTO `tmp.example` VALUES (1, 1, 11, 12);
INSERT INTO `tmp.example` VALUES (1, 2, 13, 14);
INSERT INTO `tmp.example` VALUES (2, 1, 15, 16);
INSERT INTO `tmp.example` VALUES (2, 2, 17, 18);

-- MELTING STARTS --
-- execute these two command to melt the table

-- the first generates the STRUCT commands
-- and saves a string in subcmd
EXECUTE IMMEDIATE FORMAT(
  -- please do not change this argument
  subcmd,
  -- query to retrieve the column names
  -- equivalent to value_vars in pandas.melt()
  -- the resulting table should have only one column
  -- with the name: column_name
  """
    SELECT column_name
    FROM `tmp.INFORMATION_SCHEMA.COLUMNS`
    WHERE (table_name = "example") AND (column_name NOT IN ("id0", "id1"))
  """
) INTO subcmd;

-- the second implements the melting
EXECUTE IMMEDIATE FORMAT(
  -- please do not change this argument
  cmd,
  -- query to retrieve the original table
  """
    SELECT *
    FROM `tmp.example`
  """,
  -- please do not change this argument
  subcmd,
  -- equivalent to id_vars in pandas.melt()
  -- !!please type these twice!!
  "id0, id1", "id0, id1",
  -- equivalent to var_name in pandas.melt()
  "date",
  -- equivalent to value_name in pandas.melt()
  "value"
);

【讨论】:

以上是关于在大查询中从宽到长重塑(标准 SQL)的主要内容,如果未能解决你的问题,请参考以下文章

BigQuery 使用从宽到长的结构重塑表

在大查询数据集中选择最新表 - 标准 SQL 语法

R,通过提取前缀从宽到长旋转。整齐划一

R语言使用reshape2包的melt函数将dataframe从宽表到长表(Wide- to long-format)如果没有指定行标识符号,则所有的字段都会放入variable变量中

使用多个变量和一些时间不变将数据框从宽重塑为面板

在熊猫中重塑宽到长