如何在不指定列名的情况下使用 bigquery 对表中的每一列调用内置函数?

Posted

技术标签:

【中文标题】如何在不指定列名的情况下使用 bigquery 对表中的每一列调用内置函数?【英文标题】:How to call a builtin function on every column in a table with bigquery without specifying column names? 【发布时间】:2019-10-25 14:29:49 【问题描述】:

使用看起来像这样的表格

WITH
table_one AS(
  SELECT 0.123 feat_1, 0.645 feat_2 , 1 label_bc, 'a' dontcare UNION ALL
  SELECT 0.567, 0.456, 0, 'v'   UNION ALL
  SELECT 0.243, 0.734, 1, 'x'   UNION ALL
  SELECT 0.456, 0.888, 0, 'c'   UNION ALL
  SELECT 0.645, 0.222, 1, 'x'   UNION ALL
  SELECT 0.321, 0.123, 0, 'z'  
)
SELECT * from table_one

我想计算特征和 label_bc 之间的相关性。我可以手动指定功能以获得所需的结果,如下所示:

选择 corr(label_bc, feat_1) 作为 feat1_corr, corr(label_bc, feat_2) 作为 来自 table_one 的 feat2_corr

虽然这正是我正在寻找的,但由于我必须输入列名,因此不需要它。使用我的真实世界数据,我有大量的特征,它们可能会不时变化。所以这种方法是不实用的。

我可以这样做来获得所需的列。请注意,这仅适用于真实表,而不适用于此示例 table_one 中的虚拟表。

all_columns AS (
  SELECT column_name 
  FROM `my-proj`.my_dataset.INFORMATION_SCHEMA.COLUMNS where column_name not in ('dontcare', 'label_bc') AND
  TABLE_NAME = 'table_one' -- will work if real table
)

这将返回一个表,其中有一列称为“column_name”,每一行都有一个字符串值,即列名。在这种情况下,它将是两行中的“feat_1”和“feat_2”。

我也可以这样做

column_name_array as (
  select ARRAY_AGG(all_columns) from all_columns
)

这将返回一个一列一行的表格。该列将是“all_columns”,值将是一个列名数组。 ['feat_1', 'feat_2']

话虽如此,我不能将这些原语结合起来为任何给定的具有不同列的表提供可行的解决方案,以计算所有特征列和 label_bc 列之间的 corr()。任何指导表示赞赏。


这不是一个答案,而是在解决方案上的一些额外努力。

-- Assume a table like this
----------------------------------------------------
-- feat_1   feat_2  feat_3  label_bc   dontcare
----------------------------------------------------
-- 0.123    0.645   0.656   1.0    a
-- 0.567    0.456   -0.056  0.0    b
-- 0.243    0.734   0.754   1.0    c
-- 0.456    0.888   -0.858  0.0    i
-- 0.645    0.222   0.252   1.0    j
-- 0.321    0.123   -0.153  0.0    c
----------------------------------------------------

WITH
-- my_colnames is currently not used
my_colnames (colname) as (
  SELECT 'feat_1' colname UNION ALL
  SELECT 'feat_2' UNION ALL
  SELECT 'feat_3' UNION ALL
  SELECT 'dontcare' UNION ALL
  SELECT 'label_bc'  
),
-- the static sample table
table_one AS (
  SELECT 0.123 feat_1, 
         0.645 feat_2, 
         0.656 feat_3, 
         1.0 label_bc, 
         'a' dontcare 
         UNION ALL
  SELECT 0.567, 
         0.456, 
         -0.056, 
         0.0, 
         'b'   
         UNION ALL
  SELECT 0.243, 
         0.734, 
         0.754, 
         1.0, 
         'c'   
         UNION ALL
  SELECT 0.456, 
         0.888, 
         -0.858, 
         0.0, 
         'i'   
         UNION ALL
  SELECT 0.645, 
         0.222, 
         0.252, 
         1.0, 
         'j'   
         UNION ALL
  SELECT 0.321, 
         0.123, 
         -0.153, 
         0.0, 
         'c'  
),

-- transform of table_one where feat values are moved
-- from column based to row based such as the table_one is
-- is represented like this:
----------------------------------------------------
-- feat_id  feat_val    label_bc    dontcare
----------------------------------------------------
-- 1        0.123   1.0     a
-- 1        0.567   0.0     b
-- 1        0.243   1.0     c
-- 1        0.456   0.0     i
-- 1        0.645   1.0     j
-- 1        0.321   0.0     c
-- 2        0.645   1.0     a
-- 2        0.456   0.0     b
-- 2        0.734   1.0     c
-- 2        0.888   0.0     i
-- 2        0.222   1.0     j
-- 2        0.123   0.0     c
-- 3        0.656   1.0     a
-- 3       -0.056   0.0     b
-- 3        0.754   1.0     c
-- 3       -0.858   0.0     i
-- 3        0.252   1.0     j
-- 3       -0.153   0.0     c
----------------------------------------------------
--
-- Is it possible to create this table given the my_colnames table
-- and not have to manually specify the field names below?
table_two AS (
  select 1 as feat_id, 
       feat_1 as feat_val, 
       label_bc, 
       dontcare 
  from table_one union all
  select 2, 
       feat_2, 
       label_bc, 
       dontcare 
  from table_one union all
  select 3, 
       feat_3, 
       label_bc, 
       dontcare 
  from table_one 
)

-- works
--select * from my_colnames
--select * from table_one
select * from table_two
--select corr(label_bc, feat_1) as feat1_corr, corr(label_bc, feat_2) as feat2_corr, corr(label_bc, feat_3) as feat3_corr from table_one
--select feat_id, corr(label_bc, feat_val) as feat_corr from table_two GROUP BY feat_id;

这是 Mikhail 为我这样的 SQL 新手准备的答案。我不得不用中间结果来理解他的出色方法。我认为我们需要这些特征作为一个数字,比如一个 id。这里我将他的正则表达式修改为简单地使用名称。

#standardSQL

-- Assume a table like this
----------------------------------------------------
-- feat_1   feat_2  feat_3  label_bc   dontcare
----------------------------------------------------
-- 0.123    0.645   0.656   1.0        a
-- 0.567    0.456   -0.056  0.0        b
-- 0.243    0.734   0.754   1.0        c
-- 0.456    0.888   -0.858  0.0        i
-- 0.645    0.222   0.252   1.0        j
-- 0.321    0.123   -0.153  0.0        c
----------------------------------------------------



WITH table_one AS (
  SELECT 0.123 feat_1, 0.645 feat_2, 0.656 feat_3, 1.0 label_bc, 'a' dontcare UNION ALL
  SELECT 0.567, 0.456, -0.056, 0.0, 'b' UNION ALL
  SELECT 0.243, 0.734, 0.754, 1.0, 'c' UNION ALL
  SELECT 0.456, 0.888, -0.858, 0.0, 'i' UNION ALL
  SELECT 0.645, 0.222, 0.252, 1.0, 'j' UNION ALL
  SELECT 0.321, 0.123, -0.153, 0.0, 'c' 
), 

-- transform of table_one where feat values are moved
-- from column based to row based such as the table_one is
-- represented like this:
----------------------------------------------------
-- feat_id  feat_val    label_bc    dontcare
----------------------------------------------------
-- 1        0.123   1.0     a
-- 1        0.567   0.0     b
-- 1        0.243   1.0     c
-- 1        0.456   0.0     i
-- 1        0.645   1.0     j
-- 1        0.321   0.0     c
-- 2        0.645   1.0     a
-- 2        0.456   0.0     b
-- 2        0.734   1.0     c
-- 2        0.888   0.0     i
-- 2        0.222   1.0     j
-- 2        0.123   0.0     c
-- 3        0.656   1.0     a
-- 3       -0.056   0.0     b
-- 3        0.754   1.0     c
-- 3       -0.858   0.0     i
-- 3        0.252   1.0     j
-- 3       -0.153   0.0     c
----------------------------------------------------


-- regular expression looks for a quotation mark, feat_,  a number one or more times, a quotation mark,
-- a colon, a negative sign zero or more times,  a number one or more times, a decimal point, a number one or more times 
-- SELECT 
--   REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"feat_\d+":-?\d+.\d+') 
-- from table_one as t;


-- Shows how to create a table of coefficients without specifying
-- the table column names for features using regular expressions
-- by examine each row as a json key:value pair string format.
table_two AS (
  SELECT  
    CAST(REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'feat_(\d+)') AS INT64) AS feat_id, 
    CAST(SPLIT(kv, ':')[SAFE_OFFSET(1)] AS FLOAT64) AS feat_val,
    label_bc, 
    dontcare
  FROM table_one t, 
  UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"feat_\d+":-?\d+.\d+')) kv
),
-- table_2b is first step in building table_2
-- this shows how the table_one table is cross joined with the unnested
-- array of the json for each row of table_one.
-- REGEXP_EXTRACT_ALL does not return a table.  It returns an array. One
-- element for each row in table since it uses to_JSON_STRING
-- kv column looks like
-- "feat_1:0.123
-- The table looks like this
-- Row feat_1   feat_2  feat_3  label_bc    dontcare    kv  
-- 1    0.123   0.645   0.656   1.0       a         "feat_1":0.123
-- 2    0.123   0.645   0.656   1.0       a         "feat_2":0.645
-- 3    0.123   0.645   0.656   1.0       a         "feat_3":0.656
-- 4  0.567   0.456  -0.056   0.0       b         "feat_1":0.567

table_two_b AS (
  SELECT  
    *
  FROM table_one t, 
  UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"feat_\d+":-?\d+.\d+')) kv
),
table_two_c AS (
  SELECT  
    -- split returns an array of strings. offset(0) is the first one 
    -- corresponding to the key name which is the feature_name
    REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'(.+)') AS feat_name,
    CAST(SPLIT(kv, ':')[SAFE_OFFSET(1)] AS FLOAT64) AS feat_val,
    label_bc    
  FROM table_one t, 
  UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"feat_\d+":-?\d+.\d+')) kv
)




-- select * from table_two_c


-- Shows what I want from table_two
SELECT feat_name, CORR(label_bc, feat_val) AS feat_corr 
FROM table_two_c 
GROUP BY feat_name   

-- Shows how the table is now t.feat_1, and an extra column json_row which is
-- a second copy of the table in one row using "key":value, "key":value ... syntax.
--SELECT
--  t,
--  TO_JSON_STRING(t) AS json_row
-- FROM table_one AS t;

【问题讨论】:

您是否 100% 准备好使用 BigQuery?这个问题让python尖叫。要么使用 python 直接计算相关性,要么使用 python 脚本循环遍历字段以动态构建所需的查询。 我在 R 中用 4 行代码做到了这一点。不过,我必须单独下载 csv 文件。我希望纯粹是一个 BigQuery 框架解决方案。我仍在查看 BigQuery 脚本参考,因此我很乐观,它可以完成。 【参考方案1】:

以下是 BigQuery 标准 SQL

#standardSQL
WITH table_two AS (
  SELECT  
    CAST(REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'feat_(\d+)') AS INT64) AS feat_id, 
    CAST(SPLIT(kv, ':')[SAFE_OFFSET(1)] AS FLOAT64) AS feat_val,
    label_bc, dontcare
  FROM `project.dataset.table_one` t, 
  UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"feat_\d+":-?\d+.\d+')) kv
)
SELECT feat_id, CORR(label_bc, feat_val) AS feat_corr 
FROM table_two 
GROUP BY feat_id   

您可以使用示例/虚拟数据进行测试,如下例所示

#standardSQL
WITH table_one AS (
  SELECT 0.123 feat_1, 0.645 feat_2, 0.656 feat_3, 1.0 label_bc, 'a' dontcare UNION ALL
  SELECT 0.567, 0.456, -0.056, 0.0, 'b' UNION ALL
  SELECT 0.243, 0.734, 0.754, 1.0, 'c' UNION ALL
  SELECT 0.456, 0.888, -0.858, 0.0, 'i' UNION ALL
  SELECT 0.645, 0.222, 0.252, 1.0, 'j' UNION ALL
  SELECT 0.321, 0.123, -0.153, 0.0, 'c' 
), table_two AS (
  SELECT  
    CAST(REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'feat_(\d+)') AS INT64) AS feat_id, 
    CAST(SPLIT(kv, ':')[SAFE_OFFSET(1)] AS FLOAT64) AS feat_val,
    label_bc, dontcare
  FROM table_one t, 
  UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"feat_\d+":-?\d+.\d+')) kv
)
SELECT feat_id, CORR(label_bc, feat_val) AS feat_corr 
FROM table_two 
GROUP BY feat_id   

结果

Row feat_id feat_corr    
1   1       -0.30526201038849277     
2   2       0.0818318512559385   
3   3       0.838349444539397    

【讨论】:

这很酷。我会试着理解你的方法。非常感谢。 TTYL 在宽度子句中指定项目和数据集的原因是什么?这是否会导致表格在实际的 project.dataset 中显示为表格? 只是我在显示示例时使用的模板 - 因此,当您使用 WITH 使用虚拟数据对其进行测试时 - 您可以将 project.dataset.table_one 替换为仅 table_one。我更新了答案,因此您应该更容易理解这一点 注意:当您对真实数据运行此操作时,您需要提供完全限定的表名 `project.dataset.table_one` 明白这是有道理的。

以上是关于如何在不指定列名的情况下使用 bigquery 对表中的每一列调用内置函数?的主要内容,如果未能解决你的问题,请参考以下文章

查找要插入 BigQuery 的列名

如何在不指定列名的情况下使用 AUTO_INCREMENT 列向数据库插入新行?

在不指定所有列名的情况下应用所有列?

如何在不指定列名的情况下从另一个表更新一个表?

如何在不使用 %s 的情况下安全、动态地在查询中设置列名?

是否有任何特定符号用于为 BigQuery 指定列名?