在 Bigquery 中,如何使用标准 Sql 过滤 Struct 数组以匹配 Struct 中的多个字段?

Posted

技术标签:

【中文标题】在 Bigquery 中,如何使用标准 Sql 过滤 Struct 数组以匹配 Struct 中的多个字段?【英文标题】:In Biqquery, how to filter an array of Struct on matching multiple fields in the Struct using Standard Sql? 【发布时间】:2017-06-28 21:11:29 【问题描述】:

这是表的记录布局 (load_history) 我尝试在使用标准 Sql 时使用过滤器(因为旧版 sql 在某些时候可能会过时):

[

    "mode": "NULLABLE",
    "name": "Job",
    "type": "RECORD",
    "fields": [
        
          "mode": "NULLABLE",
          "name": "name",
          "type": "STRING"
        ,
        
          "mode": "NULLABLE",
          "name": "start_time",
          "type": "TIMESTAMP"
        ,
        
          "mode": "NULLABLE",
          "name": "end_time",
          "type": "TIMESTAMP"
        ,
        
    ]
,      

    "mode": "REPEATED",
    "name": "source",
    "type": "RECORD",
    "description": "source tables touched by this job",
    "fields": [     
        
          "mode": "NULLABLE",
          "name": "database",
          "type": "STRING"
        ,
        
          "mode": "NULLABLE",
          "name": "schema",
          "type": "STRING"
        ,
        
          "mode": "NULLABLE",
          "name": "table",
          "type": "STRING"
        ,
        
          "mode": "NULLABLE",
          "name": "partition_time",
          "type": "TIMESTAMP"
            
    ]

]      

我需要过滤并仅选择数组“source”中有条目的记录,其“schema”和“table”字段与某些值匹配(例如 schema='log' AND table='customer' 在同一数组条目)。

仅在结构(模式名称)中的一个字段上进行过滤时,以下方法有效:

select name, array(select x from unnest(schema) as x where x ='log' ), table
from (select job.name , array(select schema from unnest(source)) as schema, 
      array(select table from unnest(source)) as table
      from  config.load_history)

但是,我无法过滤同一数组条目中的字段组合。

感谢您的帮助

【问题讨论】:

【参考方案1】:

对于 BigQuery 标准 SQL

#standardSQL
SELECT data
FROM data, UNNEST(source) AS s
WHERE (s.schema, s.table) = ('log', 'customer')  

#standardSQL
SELECT *
FROM data
WHERE EXISTS (
  SELECT 1 FROM UNNEST(source) AS s 
  WHERE (s.schema, s.table) = ('log', 'customer')
)

你可以用下面的虚拟数据测试/玩它

#standardSQL
WITH data AS (
  SELECT 
    STRUCT<name STRING, start_time INT64, end_time INT64>('jobA', 1, 2) AS job,
    [STRUCT<database STRING, schema STRING, table STRING, partition_time INT64>
      ('d1', 's1', 't1', 1), 
      ('d1', 's2', 't2', 2), 
      ('d1', 's3', 't3', 3) 
    ] AS source UNION ALL
  SELECT 
    STRUCT<name STRING, start_time INT64, end_time INT64>('jobB', 1, 2) AS job,
    [STRUCT<database STRING, schema STRING, table STRING, partition_time INT64>
      ('d1', 's1', 't1', 1), 
      ('d2', 's4', 't2', 2), 
      ('d2', 's3', 't3', 3) 
    ] AS source 
)
SELECT *
FROM data
WHERE EXISTS (
  SELECT 1 FROM UNNEST(source) AS s 
  WHERE (s.schema, s.table) = ('s2', 't2')
)

【讨论】:

非常感谢。第一个解决方案简短而有效。【参考方案2】:

听起来你想要这样的东西:

SELECT
  job.name,
  ARRAY(SELECT schema FROM UNNEST(matching_sources)) AS matching_schemas,
  ARRAY(SELECT table FROM UNNEST(matching_sources)) AS matching_tables
FROM (
  SELECT *,
    ARRAY(SELECT AS STRUCT * FROM UNNEST(sources)
          WHERE schema = 'log' AND `table` = 'customer') AS matching_sources
  FROM YourTable
)
WHERE ARRAY_LENGTH(matching_sources) > 0;

这将返回一个模式数组和一个表数组,它们都匹配条件,并排除数组中没有条目匹配条件的行。

【讨论】:

【参考方案3】:

我需要过滤并仅选择数组“源”中存在“模式”和“表”字段匹配某些值的条目的记录

这听起来可以用一个简单的WHERE 子句来解决,就像这样:

WITH data AS(
  select STRUCT<name STRING, start_time TIMESTAMP, end_time TIMESTAMP> ('job_1', TIMESTAMP("2017-06-10"), TIMESTAMP("2017-06-11")) Job, ARRAY<STRUCT<database STRING, schema STRING, table STRING, partition_time TIMESTAMP> > [STRUCT('database_1', "schema_1", "table_1", TIMESTAMP("2017-06-10")), STRUCT('database_1', "schema_1", "table_2", TIMESTAMP("2017-06-10")), STRUCT('database_1', "schema_3", "table_1", TIMESTAMP("2017-06-10")), STRUCT('database_2', "schema_2", "table_2", TIMESTAMP("2017-06-10"))] source union all
  select STRUCT<name STRING, start_time TIMESTAMP, end_time TIMESTAMP> ('job_2', TIMESTAMP("2017-06-10"), TIMESTAMP("2017-06-11")) Job, ARRAY<STRUCT<database STRING, schema STRING, table STRING, partition_time TIMESTAMP> > [STRUCT('database_2', "schema_2", "table_2", TIMESTAMP("2017-06-10")), STRUCT('database_2', "schema_2", "table_3", TIMESTAMP("2017-06-10")), STRUCT('database_1', "schema_1", "table_3", TIMESTAMP("2017-06-10"))] source
)

SELECT
  *
FROM data
WHERE EXISTS(SELECT 1 FROM UNNEST(source) WHERE schema = "schema_2" AND table = "table_2")

这将返回在某些时候具有给定架构和给定表的所有行。

如果你想在输出中只过滤出匹配过滤器的记录,你也可以运行这个:

SELECT
  job.*,
  ARRAY(SELECT AS STRUCT database, schema, table, partition_time FROM UNNEST(source) WHERE schema = "schema_2" AND table = "table_2") filtered_data
FROM data
  WHERE EXISTS(SELECT 1 FROM UNNEST(source) WHERE schema = "schema_2" AND table = "table_2")

不确定这是否正是您想要的问题,但它可能会让您了解如何从 ARRAY 中过滤掉值。

【讨论】:

【参考方案4】:

正如 Mikhail-berlyant https://***.com/users/5221944/mikhail-berlyant 很好解释的那样 我用的是第一个例子。

SELECT data
FROM data, UNNEST(source) AS s
WHERE (s.schema, s.table) = ('log', 'customer')  

让我解释一下我的例子: 如果我想从谷歌公共专利中获得与具体 cpc 代码完全匹配的行

通常我会使用 Like 条件

SELECT cpc
FROM
`patents-public-data.patents.publications`
where cpc like "%G01R31/007"

我不能将它用于此目的,因为 CPC 单元格包含一个数组列表 ['code': 'G01R31/007', 'inventive': True, 'first': False, 'tree': [] ]

所以我需要将此数组划分为块,并且我将寻址到 code 标识符并将我的查询等同于我想要提取的确切值 - 在可能的情况下它是 G01R31/007

这里的代码如下:

SELECT publication_number, cpc
FROM `patents-public-data.patents.publications`, 
UNNEST(cpc) AS s
WHERE (s.code) = ('G01R31/007')

【讨论】:

以上是关于在 Bigquery 中,如何使用标准 Sql 过滤 Struct 数组以匹配 Struct 中的多个字段?的主要内容,如果未能解决你的问题,请参考以下文章

BigQuery:如何在 C# 中启用标准 SQL

BigQuery 标准 SQL 如何将行转换为列

如何在 BigQuery 标准 SQL 中查询 Bigtable 列值?

如何在 BigQuery 中使用标准 SQL 查询 GA RealtimeView?

在 Bigquery 中,如何使用标准 Sql 过滤 Struct 数组以匹配 Struct 中的多个字段?

如何在 BigQuery 标准 SQL 中将时间戳转换为秒