在大查询中从宽到长重塑(标准 SQL)
Posted
技术标签:
【中文标题】在大查询中从宽到长重塑(标准 SQL)【英文标题】:Reshape from wide to long in big query (standard SQL) 【发布时间】:2017-12-05 10:02:31 【问题描述】:不幸的是,在 BQ 中重塑它不像在 R 中那么容易,而且我无法为这个项目导出我的数据。
这里是输入
date country A B C D
20170928 CH 3000.3 121 13 3200
20170929 CH 2800.31 137 23 1614.31
预期输出
date country Metric Value
20170928 CH A 3000.3
20170928 CH B 121
20170928 CH C 13
20170928 CH D 3200
20170929 CH A 2800.31
20170929 CH B 137
20170929 CH C 23
20170929 CH D 1614.31
我的表格还有更多的列和行(但我认为需要很多手册)
【问题讨论】:
【参考方案1】:以下适用于 BigQuery 标准 SQL,不需要重复选择,具体取决于列数。它会选择尽可能多的并将它们转换为指标和值
#standardSQL
SELECT DATE, country,
metric, SAFE_CAST(value AS FLOAT64) value
FROM (
SELECT DATE, country,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
FROM `project.dataset.yourtable` t,
UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'|', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')
您可以像在您的问题中那样使用虚拟数据测试/玩上面的内容
#standardSQL
WITH `project.dataset.yourtable` AS (
SELECT '20170928' DATE, 'CH' country, 3000.3 A, 121 B, 13 C, 3200 D UNION ALL
SELECT '20170929', 'CH', 2800.31, 137, 23, 1614.31
)
SELECT DATE, country,
metric, SAFE_CAST(value AS FLOAT64) value
FROM (
SELECT DATE, country,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
FROM `project.dataset.yourtable` t,
UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'|', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')
结果符合预期
DATE country metric value
20170928 CH A 3000.3
20170928 CH B 121.0
20170928 CH C 13.0
20170928 CH D 3200.0
20170929 CH A 2800.31
20170929 CH B 137.0
20170929 CH C 23.0
20170929 CH D 1614.31
【讨论】:
如果任何值是包含空格的字符串,这将不起作用。 当然。它没有 - 因为这不是这篇文章的情况。您需要空间解决方案吗? 您在哪里看到有空格的可能性?国家代码?不!列名 - 不!那么你的评论有什么意义,为什么要投反对票?! 很遗憾,我现在无法收回投票。首先它的原因是问题标题非常通用,并且没有仅指定数字数据。一个通用的问题让我期待一个通用的答案,但事实并非如此。【参考方案2】:您需要UNION
,在 bigquery 中使用逗号表示
SELECT date, country, Metric, Value
FROM (
SELECT date, country, 'A' as Metric, A as Value FROM your_table
), (
SELECT date, country, 'B' as Metric, B as Value FROM your_table
), (
SELECT date, country, 'C' as Metric, C as Value FROM your_table
) , (
SELECT date, country, 'D' as Metric, D as Value FROM your_table
)
【讨论】:
这行得通,但对于标准 SQL,您实际上需要使用UNION ALL
,因为 coma 是交叉连接。谢谢!
@AlienDeg 如果date
是唯一的(我相信它是唯一的),那么UNION
和UNION ALL
之间将没有区别。欢迎你:)
我的意思是 UNION 在标准 SQL 中不使用逗号表示,仅在旧版中。【参考方案3】:
我设法找到的大多数答案都需要指定要融化的每一列的名称。当我在表中有数百到数千列时,这很难处理。这是一个适用于任意宽表的答案。
它利用动态 SQL 并自动从数据模式中提取多个列名,整理命令字符串,然后评估该字符串。这是为了模仿 Python pandas.melt() / R reshape2::melt() 行为。
由于 UDF 的一些不良属性,我故意不创建用户定义的函数。根据您的使用方式,您可能想要也可能不想这样做。
输入:
id0 id1 _2020_05_27 _2020_05_28
1 1 11 12
1 2 13 14
2 1 15 16
2 2 17 18
输出:
id0 id1 date value
1 2 _2020_05_27 13
1 2 _2020_05_28 14
2 2 _2020_05_27 17
2 2 _2020_05_28 18
1 1 _2020_05_27 11
1 1 _2020_05_28 12
2 1 _2020_05_27 15
2 1 _2020_05_28 16
#standardSQL
-- PANDAS MELT FUNCTION IN GOOGLE BIGQUERY
-- author: Luna Huang
-- email: lunahuang@google.com
-- run this script with Google BigQuery Web UI in the Cloud Console
-- this piece of code functions like the pandas melt function
-- pandas.melt(id_vars, value_vars, var_name, value_name, col_level=None)
-- without utilizing user defined functions (UDFs)
-- see below for where to input corresponding arguments
DECLARE cmd STRING;
DECLARE subcmd STRING;
SET cmd = ("""
WITH original AS (
-- query to retrieve the original table
%s
),
nested AS (
SELECT
[
-- sub command to be automatically generated
%s
] as s,
-- equivalent to id_vars in pandas.melt()
%s,
FROM original
)
SELECT
-- equivalent to id_vars in pandas.melt()
%s,
-- equivalent to var_name in pandas.melt()
s.key AS %s,
-- equivalent to value_name in pandas.melt()
s.value AS %s,
FROM nested
CROSS JOIN UNNEST(nested.s) AS s
""");
SET subcmd = ("""
WITH
columns AS (
-- query to retrieve the column names
-- equivalent to value_vars in pandas.melt()
-- the resulting table should have only one column
-- with the name: column_name
%s
),
scs AS (
SELECT FORMAT("STRUCT('%%s' as key, %%s as value)", column_name, column_name) AS sc
FROM columns
)
SELECT ARRAY_TO_STRING(ARRAY (SELECT sc FROM scs), ",\\n")
""");
-- -- -- EXAMPLE BELOW -- -- --
-- SET UP AN EXAMPLE TABLE --
CREATE OR REPLACE TABLE `tmp.example`
(
id0 INT64,
id1 INT64,
_2020_05_27 INT64,
_2020_05_28 INT64,
);
INSERT INTO `tmp.example` VALUES (1, 1, 11, 12);
INSERT INTO `tmp.example` VALUES (1, 2, 13, 14);
INSERT INTO `tmp.example` VALUES (2, 1, 15, 16);
INSERT INTO `tmp.example` VALUES (2, 2, 17, 18);
-- MELTING STARTS --
-- execute these two command to melt the table
-- the first generates the STRUCT commands
-- and saves a string in subcmd
EXECUTE IMMEDIATE FORMAT(
-- please do not change this argument
subcmd,
-- query to retrieve the column names
-- equivalent to value_vars in pandas.melt()
-- the resulting table should have only one column
-- with the name: column_name
"""
SELECT column_name
FROM `tmp.INFORMATION_SCHEMA.COLUMNS`
WHERE (table_name = "example") AND (column_name NOT IN ("id0", "id1"))
"""
) INTO subcmd;
-- the second implements the melting
EXECUTE IMMEDIATE FORMAT(
-- please do not change this argument
cmd,
-- query to retrieve the original table
"""
SELECT *
FROM `tmp.example`
""",
-- please do not change this argument
subcmd,
-- equivalent to id_vars in pandas.melt()
-- !!please type these twice!!
"id0, id1", "id0, id1",
-- equivalent to var_name in pandas.melt()
"date",
-- equivalent to value_name in pandas.melt()
"value"
);
【讨论】:
以上是关于在大查询中从宽到长重塑(标准 SQL)的主要内容,如果未能解决你的问题,请参考以下文章
R语言使用reshape2包的melt函数将dataframe从宽表到长表(Wide- to long-format)如果没有指定行标识符号,则所有的字段都会放入variable变量中