对 BigQuery 中的重复字段求和
Posted
技术标签:
【中文标题】对 BigQuery 中的重复字段求和【英文标题】:Summing repeated fields in BigQuery 【发布时间】:2016-08-16 15:26:34 【问题描述】:我会尽可能清楚地解释我的问题,如果不是,请告诉我。
我有一张桌子[MyTable]
,看起来像这样:
----------------------------------------
|chn:integer | auds:integer (repeated) |
----------------------------------------
|1 |3916 |
|1 |4983 |
|1 |6233 |
|1 |1214 |
|2 |1200 |
|2 |900 |
|2 |2030 |
|2 |2345 |
----------------------------------------
Auds
总是重复 4 次。
如果我查询SELECT chn, auds FROM [MyTable] WHERE chn = 1
,我会得到以下结果:
-------------------
|Row | chn | auds |
-------------------
|1 |1 |3916 |
|2 |1 |4983 |
|3 |1 |6233 |
|4 |1 |1214 |
-------------------
如果我查询SELECT chn, auds FROM [MyTable] WHERE (chn = 1 OR chn = 2)
,我会得到以下结果:
-------------------
|Row | chn | auds |
-------------------
|1 |1 |1200 |
|2 |1 |900 |
|3 |1 |2030 |
|4 |2 |2345 |
-------------------
从逻辑上讲,我得到了两倍的结果,但我想要得到的是重复字段 auds
的 chn = 1
和 chn = 2
的 SUM()
,或者在视觉上是这样的:
-------------------
|Row | chn | auds |
-------------------
|1 |3 |5116 |
|2 |3 |5883 |
|3 |3 |8263 |
|4 |3 |3559 |
-------------------
我尝试了一些事情:
SELECT a1+a2 FROM
(SELECT auds AS a1 FROM [MyTable] WHERE chn = 1),
(SELECT auds AS a2 FROM [MyTable] WHERE chn = 2)
但我收到以下错误:
Error: Cannot query the cross product of repeated fields a1 and a2.
【问题讨论】:
我建议您将示例简化为 2-4 个“重复”而不是 1440 次,并提供输入和预期输出的清晰示例 @MikhailBerlyant 刚刚编辑,感谢您的建议,我希望现在更清楚了。 【参考方案1】:用standard SQL 表达这种逻辑要容易得多(取消选中“显示选项”下的“使用旧版 SQL”)。下面是一个计算 auds
数组总和的示例:
WITH MyTable AS (
SELECT
1 AS chn,
[2, 3, 4, 5, 6] AS auds
UNION ALL SELECT
2 AS chn,
[7, 8, 9, 10, 11] AS auds
)
SELECT
chn,
(SELECT SUM(aud) FROM UNNEST(auds) AS aud) AS auds_sum
FROM MyTable;
+-----+----------+
| chn | auds_sum |
+-----+----------+
| 1 | 20 |
| 2 | 45 |
+-----+----------+
另一个计算 chn = 1
和 chn = 2
的成对总和(根据您的问题,我认为这是您想要的):
WITH MyTable AS (
SELECT
1 AS chn,
[2, 3, 4, 5, 6] AS auds
UNION ALL SELECT
2 AS chn,
[7, 8, 9, 10, 11] AS auds
)
SELECT
ARRAY(SELECT first_aud + second_auds[OFFSET(off)]
FROM UNNEST(first_auds) AS first_aud WITH OFFSET off)
AS summed_auds
FROM (
SELECT
(SELECT auds FROM MyTable WHERE chn = 1) AS first_auds,
(SELECT auds FROM MyTable WHERE chn = 2) AS second_auds
);
+---------------------+
| summed_auds |
+---------------------+
| [9, 11, 13, 15, 17] |
+---------------------+
编辑:又一个示例将所有行中的相应数组元素相加。这可能不会特别有效,但应该会产生预期的结果:
WITH MyTable AS (
SELECT
1 AS chn,
[2, 3, 4, 5, 6] AS auds
UNION ALL SELECT
2 AS chn,
[7, 8, 9, 10, 11] AS auds
UNION ALL SELECT
3 AS chn,
[-1, -6, 2, 3, 2] AS auds
)
SELECT
ARRAY(SELECT
(SELECT SUM(auds[OFFSET(off)]) FROM UNNEST(all_auds))
FROM UNNEST(all_auds[OFFSET(0)].auds) WITH OFFSET off)
AS summed_auds
FROM (
SELECT
ARRAY_AGG(STRUCT(auds)) AS all_auds
FROM MyTable
);
+--------------------+
| summed_auds |
+--------------------+
| [8, 5, 15, 18, 19] |
+--------------------+
【讨论】:
这似乎符合我的要求,谢谢!我将我的问题简化为 2 个不同的chn
但问题是这个数字在波动,大约在 450-500 之间,具体取决于表格。有没有一种简单的方法可以使解决方案适应不同chn
的变体数量?
好的,看看您对帖子中的新示例有何看法。它只是对所有行中的相应数组元素求和(并且不对chn
做出任何假设)。希望您可以根据您的用例对其进行调整。【参考方案2】:
Elliott 的回答一直是我的灵感来源!如果它适合你,请投票并接受他的回答(它应该:o)) 同时,想用Scalar JS UDF添加替代选项
CREATE TEMPORARY FUNCTION mySUM(a ARRAY<INT64>, b ARRAY<INT64>)
RETURNS ARRAY<INT64>
LANGUAGE js AS """
var sum = [];
for(var i = 0; i < a.length; i++)
sum.push(parseInt(a[i]) + parseInt(b[i]));
return sum
""";
WITH MyTable AS (
SELECT
1 AS chn,
[2, 3, 4, 5, 6] AS auds
UNION ALL SELECT
2 AS chn,
[7, 8, 9, 10, 11] AS auds
)
SELECT
first_auds.chn AS first_auds_chn,
second_auds.chn AS second_auds_chn,
mySUM(first_auds.auds, second_auds.auds) AS summed_auds
FROM MyTable AS first_auds
JOIN MyTable AS second_auds
ON first_auds.chn = 1 AND second_auds.chn = 2
我喜欢这个选项,因为它较少包含多个 UNNEST、ARRAY 等,因此阅读起来更加清晰。
【讨论】:
我会接受 Elliott 的回答,因为它更容易适应我的真实表/问题。非常感谢你的提议 :)【参考方案3】:只需将GROUP BY
与SUM
结合使用即可。
SELECT SUM(auds), chn FROM [MyTable] GROUP BY chn
【讨论】:
这会将chn = 1
和chn = 2
的所有1440 个auds
条目相加,并且只给我2 行,这不是我想要的。我正在尝试对每个 auds
"arrays" 进行并行求和以上是关于对 BigQuery 中的重复字段求和的主要内容,如果未能解决你的问题,请参考以下文章
sql [BigQuery - Facebook产品目录]查询para obtenerelcatálogodeproductos de Kichink。 #facebook #bigqu