在 Bigquery 中,如何使用标准 Sql 过滤 Struct 数组以匹配 Struct 中的多个字段?
Posted
技术标签:
【中文标题】在 Bigquery 中,如何使用标准 Sql 过滤 Struct 数组以匹配 Struct 中的多个字段?【英文标题】:In Biqquery, how to filter an array of Struct on matching multiple fields in the Struct using Standard Sql? 【发布时间】:2017-06-28 21:11:29 【问题描述】:这是表的记录布局 (load_history) 我尝试在使用标准 Sql 时使用过滤器(因为旧版 sql 在某些时候可能会过时):
[
"mode": "NULLABLE",
"name": "Job",
"type": "RECORD",
"fields": [
"mode": "NULLABLE",
"name": "name",
"type": "STRING"
,
"mode": "NULLABLE",
"name": "start_time",
"type": "TIMESTAMP"
,
"mode": "NULLABLE",
"name": "end_time",
"type": "TIMESTAMP"
,
]
,
"mode": "REPEATED",
"name": "source",
"type": "RECORD",
"description": "source tables touched by this job",
"fields": [
"mode": "NULLABLE",
"name": "database",
"type": "STRING"
,
"mode": "NULLABLE",
"name": "schema",
"type": "STRING"
,
"mode": "NULLABLE",
"name": "table",
"type": "STRING"
,
"mode": "NULLABLE",
"name": "partition_time",
"type": "TIMESTAMP"
]
]
我需要过滤并仅选择数组“source”中有条目的记录,其“schema”和“table”字段与某些值匹配(例如 schema='log' AND table='customer' 在同一数组条目)。
仅在结构(模式名称)中的一个字段上进行过滤时,以下方法有效:
select name, array(select x from unnest(schema) as x where x ='log' ), table
from (select job.name , array(select schema from unnest(source)) as schema,
array(select table from unnest(source)) as table
from config.load_history)
但是,我无法过滤同一数组条目中的字段组合。
感谢您的帮助
【问题讨论】:
【参考方案1】:对于 BigQuery 标准 SQL
#standardSQL
SELECT data
FROM data, UNNEST(source) AS s
WHERE (s.schema, s.table) = ('log', 'customer')
或
#standardSQL
SELECT *
FROM data
WHERE EXISTS (
SELECT 1 FROM UNNEST(source) AS s
WHERE (s.schema, s.table) = ('log', 'customer')
)
你可以用下面的虚拟数据测试/玩它
#standardSQL
WITH data AS (
SELECT
STRUCT<name STRING, start_time INT64, end_time INT64>('jobA', 1, 2) AS job,
[STRUCT<database STRING, schema STRING, table STRING, partition_time INT64>
('d1', 's1', 't1', 1),
('d1', 's2', 't2', 2),
('d1', 's3', 't3', 3)
] AS source UNION ALL
SELECT
STRUCT<name STRING, start_time INT64, end_time INT64>('jobB', 1, 2) AS job,
[STRUCT<database STRING, schema STRING, table STRING, partition_time INT64>
('d1', 's1', 't1', 1),
('d2', 's4', 't2', 2),
('d2', 's3', 't3', 3)
] AS source
)
SELECT *
FROM data
WHERE EXISTS (
SELECT 1 FROM UNNEST(source) AS s
WHERE (s.schema, s.table) = ('s2', 't2')
)
【讨论】:
非常感谢。第一个解决方案简短而有效。【参考方案2】:听起来你想要这样的东西:
SELECT
job.name,
ARRAY(SELECT schema FROM UNNEST(matching_sources)) AS matching_schemas,
ARRAY(SELECT table FROM UNNEST(matching_sources)) AS matching_tables
FROM (
SELECT *,
ARRAY(SELECT AS STRUCT * FROM UNNEST(sources)
WHERE schema = 'log' AND `table` = 'customer') AS matching_sources
FROM YourTable
)
WHERE ARRAY_LENGTH(matching_sources) > 0;
这将返回一个模式数组和一个表数组,它们都匹配条件,并排除数组中没有条目匹配条件的行。
【讨论】:
【参考方案3】:我需要过滤并仅选择数组“源”中存在“模式”和“表”字段匹配某些值的条目的记录
这听起来可以用一个简单的WHERE
子句来解决,就像这样:
WITH data AS(
select STRUCT<name STRING, start_time TIMESTAMP, end_time TIMESTAMP> ('job_1', TIMESTAMP("2017-06-10"), TIMESTAMP("2017-06-11")) Job, ARRAY<STRUCT<database STRING, schema STRING, table STRING, partition_time TIMESTAMP> > [STRUCT('database_1', "schema_1", "table_1", TIMESTAMP("2017-06-10")), STRUCT('database_1', "schema_1", "table_2", TIMESTAMP("2017-06-10")), STRUCT('database_1', "schema_3", "table_1", TIMESTAMP("2017-06-10")), STRUCT('database_2', "schema_2", "table_2", TIMESTAMP("2017-06-10"))] source union all
select STRUCT<name STRING, start_time TIMESTAMP, end_time TIMESTAMP> ('job_2', TIMESTAMP("2017-06-10"), TIMESTAMP("2017-06-11")) Job, ARRAY<STRUCT<database STRING, schema STRING, table STRING, partition_time TIMESTAMP> > [STRUCT('database_2', "schema_2", "table_2", TIMESTAMP("2017-06-10")), STRUCT('database_2', "schema_2", "table_3", TIMESTAMP("2017-06-10")), STRUCT('database_1', "schema_1", "table_3", TIMESTAMP("2017-06-10"))] source
)
SELECT
*
FROM data
WHERE EXISTS(SELECT 1 FROM UNNEST(source) WHERE schema = "schema_2" AND table = "table_2")
这将返回在某些时候具有给定架构和给定表的所有行。
如果你想在输出中只过滤出匹配过滤器的记录,你也可以运行这个:
SELECT
job.*,
ARRAY(SELECT AS STRUCT database, schema, table, partition_time FROM UNNEST(source) WHERE schema = "schema_2" AND table = "table_2") filtered_data
FROM data
WHERE EXISTS(SELECT 1 FROM UNNEST(source) WHERE schema = "schema_2" AND table = "table_2")
不确定这是否正是您想要的问题,但它可能会让您了解如何从 ARRAY 中过滤掉值。
【讨论】:
【参考方案4】:正如 Mikhail-berlyant https://***.com/users/5221944/mikhail-berlyant 很好解释的那样 我用的是第一个例子。
SELECT data
FROM data, UNNEST(source) AS s
WHERE (s.schema, s.table) = ('log', 'customer')
让我解释一下我的例子: 如果我想从谷歌公共专利中获得与具体 cpc 代码完全匹配的行
通常我会使用 Like 条件
SELECT cpc
FROM
`patents-public-data.patents.publications`
where cpc like "%G01R31/007"
我不能将它用于此目的,因为 CPC 单元格包含一个数组列表 ['code': 'G01R31/007', 'inventive': True, 'first': False, 'tree': [] ]
所以我需要将此数组划分为块,并且我将寻址到 code 标识符并将我的查询等同于我想要提取的确切值 - 在可能的情况下它是 G01R31/007
这里的代码如下:
SELECT publication_number, cpc
FROM `patents-public-data.patents.publications`,
UNNEST(cpc) AS s
WHERE (s.code) = ('G01R31/007')
【讨论】:
以上是关于在 Bigquery 中,如何使用标准 Sql 过滤 Struct 数组以匹配 Struct 中的多个字段?的主要内容,如果未能解决你的问题,请参考以下文章
如何在 BigQuery 标准 SQL 中查询 Bigtable 列值?
如何在 BigQuery 中使用标准 SQL 查询 GA RealtimeView?