BigQuery:使用交叉引用查询重复字段
Posted
技术标签:
【中文标题】BigQuery:使用交叉引用查询重复字段【英文标题】:BigQuery: Querying Repeated Fields with Cross Reference 【发布时间】:2018-07-04 13:46:10 【问题描述】:我继承了一个具有某种特殊架构的 BigQuery 表:
[
"name":"hardware_id", "type":"STRING", "mode":"NULLABLE" ,
"name":"manufacturer", "type":"STRING", "mode":"NULLABLE" ,
"name":"model", "type":"STRING", "mode":"NULLABLE" ,
"fields":[
"name":"brand", "type":"STRING", "mode":"REPEATED" ,
"name":"model_name", "type":"STRING", "mode":"NULLABLE"
], "name":"components", "type":"RECORD", "mode":"REPEATED" ,
"name":"ram", "type":"INTEGER", "mode":"NULLABLE" ,
"name":"hdd", "type":"INTEGER", "mode":"NULLABLE"
]
数据的结构如下:
hw_id | manufacturer | model | components.type | components.model_name | ram | hdd
------+--------------+-------+-----------------+-----------------------+-----+-----
1 | Lenovo | ABX | GPU | Radeon 5500 | 16 | 1000
| | | CPU | Core i7 | |
| | | SCSI Controller | Adaptec 2940 | |
------+--------------+-------+-----------------+-----------------------+-----+-----
2 | Dell | ZXV | CPU | Core i7 | 4 | 500
| | | GPU | GeForce | |
| | | Sound | SoundBlaster | |
------+--------------+-------+-----------------+-----------------------+-----+-----
3 | IBM | PS/2 | CPU | i386 | 1 | 100
| | | Sound | SoundBlaster | |
| | | GPU | GeForce | |
我想一次查询多个组件,例如找到所有具有 Core i7 CPU 和 SoundBlaster 声卡的硬件。不幸的是,“组件”字段的顺序不一致,“型号名称”可能会产生歧义,因此我还需要查询对应的“品牌”字段。
我可以为单个组件创建查询,但还不能同时为多个组件创建查询。你能提示我正确的方向吗?
【问题讨论】:
【参考方案1】:以下是 BigQuery 标准 SQL
#standardSQL
SELECT *
FROM `project.dataset.table`
WHERE 2 =
(SELECT COUNT(1)
FROM UNNEST(components)
WHERE (type, model_name) IN (
('Sound', 'SoundBlaster'), ('CPU', 'Core i7')
)
)
您可以使用您问题中的虚拟数据进行测试,如下所示
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 hw_id, 'Lenovo' manufacturer, 'ABX' model, [STRUCT<type STRING, model_name STRING>('GPU', 'Radeon 5500'), ('CPU', 'Core i7'), ('SCSI Controller', 'Adaptec 2940')] components, 16 ram, 1000 hdd UNION ALL
SELECT 2, 'Dell', 'ZXV', [('CPU', 'Core i7'), ('GPU', 'GeForce'), ('Sound', 'SoundBlaster')], 4, 500 UNION ALL
SELECT 3, 'IBM', 'PS/2', [('CPU', 'i386'), ('Sound', 'SoundBlaster'), ('GPU', 'GeForce')], 1, 100
)
SELECT *
FROM `project.dataset.table`
WHERE 2 =
(SELECT COUNT(1)
FROM UNNEST(components)
WHERE (type, model_name) IN (
('Sound', 'SoundBlaster'), ('CPU', 'Core i7')
)
)
所以结果将是
hw_id | manufacturer | model | components.type | components.model_name | ram | hdd
------+--------------+-------+-----------------+-----------------------+-----+-----
2 | Dell | ZXV | CPU | Core i7 | 4 | 500
| | | GPU | GeForce | |
| | | Sound | SoundBlaster | |
【讨论】:
以上是关于BigQuery:使用交叉引用查询重复字段的主要内容,如果未能解决你的问题,请参考以下文章
JSON 表架构到 bigquery.TableSchema 用于 BigQuerySink