如何对 BigQuery 中的重复字段进行分组

Posted

技术标签:

【中文标题】如何对 BigQuery 中的重复字段进行分组【英文标题】:How to do group by on repeated field in BigQuery 【发布时间】:2017-05-18 07:15:48 【问题描述】:

在 BigQuery 中,我使用以下架构创建了一个表

id                  INTEGER NULLABLE    
visits              INTEGER NULLABLE    
dimensions          RECORD  REPEATED    
dimensions.value    STRING  
dimensions.key      STRING  

如何通过分组设备和状态值来获得总和(访问)?

示例数据:

"id": 1, visits: 100, "dimensions": ["key":"device","value":"mobile", "key":"state","value":"CA"]
"id": 1, visits: 500, "dimensions": ["key":"device","value":"desktop", "key":"state","value":"CA"]
"id": 1, visits: 200, "dimensions": ["key":"device","value":"mobile", "key":"state","value":"NY"]
"id": 2, visits: 100, "dimensions": ["key":"device","value":"mobile", "key":"state","value":"CA"]
"id": 2, visits: 500, "dimensions": ["key":"device","value":"desktop", "key":"state","value":"CA"]
"id": 2, visits: 200, "dimensions": ["key":"device","value":"mobile", "key":"state","value":"NY"]
"id": 2, visits: 780, "dimensions": ["key":"device","value":"desktop", "key":"state","value":"NY"]

我想要输出中的 id、device、state、sum(visits)。

我可以使用带有以下查询的单个维度进行分组,但不知道如何针对多个维度进行分组。

SELECT id,d.value, sum(visits) FROM dataset.tabe_name,UNNEST(dimensions) as d where d.key = "device" group by id, d.value LIMIT 1000

如果事先不知道键值,是否可以编写通用查询?

【问题讨论】:

【参考方案1】:

以下是 BigQuery 标准 SQL

#standardSQL
SELECT 
  id,
  (SELECT value FROM UNNEST(dimensions) WHERE key = "device") AS device,
  (SELECT value FROM UNNEST(dimensions) WHERE key = "state") AS state,
  SUM(visits) AS visits
FROM `dataset.tabe_name`  
GROUP BY id, device, state
LIMIT 1000   

您可以使用示例中的虚拟数据尝试/播放它,如下所示

#standardSQL
WITH data AS (
  SELECT 1 AS id, 100 AS visits, ARRAY<STRUCT<key STRING, value STRING>>[("device", "mobile"), ("state", "CA")] AS dimensions UNION ALL
  SELECT 1, 100, [STRUCT<key STRING, value STRING>("device", "mobile"), ("state", "CA")] UNION ALL
  SELECT 1, 500, [STRUCT<key STRING, value STRING>("device", "desktop"), ("state", "CA")] UNION ALL
  SELECT 1, 200, [STRUCT<key STRING, value STRING>("device", "mobile"), ("state", "NY")] UNION ALL
  SELECT 2, 100, [STRUCT<key STRING, value STRING>("device", "mobile"), ("state", "CA")] UNION ALL
  SELECT 2, 500, [STRUCT<key STRING, value STRING>("device", "desktop"), ("state", "CA")] UNION ALL
  SELECT 2, 200, [STRUCT<key STRING, value STRING>("device", "mobile"), ("state", "NY")] UNION ALL
  SELECT 2, 780, [STRUCT<key STRING, value STRING>("device", "desktop"), ("state", "NY")] 
)
SELECT 
  id,
  (SELECT value FROM UNNEST(dimensions) WHERE key = "device") AS device,
  (SELECT value FROM UNNEST(dimensions) WHERE key = "state") AS state,
  SUM(visits) AS visits
FROM data  
GROUP BY id, device, state
-- ORDER BY id, device, state

【讨论】:

以上是关于如何对 BigQuery 中的重复字段进行分组的主要内容,如果未能解决你的问题,请参考以下文章

在 Power BI 中使用 BigQuery 重复/嵌套字段

BigQuery:对具有不同字段顺序的重复字段进行联合

对 BigQuery 中的重复字段求和

如何按 RDD 中的选定字段数进行分组,以查找基于这些字段的重复项

在 C# 中插入具有重复记录列的 BigQuery 行

选择查询以使用 BigQuery 对输出 json 中的记录进行分组