新列的多个 BigQuery 子选择

Posted

技术标签:

【中文标题】新列的多个 BigQuery 子选择【英文标题】:Multiple BigQuery subselects for new columns 【发布时间】:2016-06-16 21:27:23 【问题描述】:

我有一张表,其中包含企业名称与邮政编码的一对多关系,因为多个行业代码与给定邮政编码的该企业相匹配。一个单独的表格包含按邮政编码的家庭。为了将邮政编码作为行和给定企业作为一列的家庭相加,同时对多行中匹配相同邮政编码的家庭进行重复数据删除(以避免过度计算家庭),我查询

SELECT ZIPCode, SUM(SumHouseholds1) AS Company1  
FROM (  
    SELECT ZIPCode, SUM(Households) OVER (PARTITION BY ZIPCode, DBAName) AS SumHouseholds1  
    FROM Business  
    JOIN Location  
    ON Location.ZIPCode = Business.ZIPCode  
    WHERE DBAName='Company1'  
GROUP BY DBAName, ZIPCode, Households)  
GROUP BY ZIPCode  

这样的输出:


邮政编码公司1 10001 17007 10003 54084

当我尝试向原始 SELECT 语句添加其他列(Company2、Company3 等)时:

SELECT ZIPCode, SUM(SumHouseholds1) AS Company1  
FROM (  
    SELECT ZIPCode, SUM(Households) OVER (PARTITION BY ZIPCode, DBAName) AS SumHouseholds1  
    FROM Business  
    JOIN Location  
    ON Location.ZIPCode = Business.ZIPCode  
    WHERE DBAName='Company1'  
GROUP BY DBAName, ZIPCode, Households),  
SUM(SumHouseholds2) AS Company2  
FROM (  
    SELECT ZIPCode, SUM(Households) OVER (PARTITION BY ZIPCode, DBAName) AS SumHouseholds2  
    FROM Business  
    JOIN Location  
    ON Location.ZIPCode = Business.ZIPCode  
    WHERE DBAName='Company2'  
GROUP BY DBAName, ZIPCode, Households)
GROUP BY ZIPCode 

我遇到了一个“遇到”“FROM”“FROM”“错误。

【问题讨论】:

你能解释两件事吗:1。为什么你在这里使用 SUM () OVER() 2。为什么要按家庭分组; Overall - 到目前为止,您的代码对我来说没有任何意义!我建议提供您的输入和所需输出的示例以及一些逻辑细节。 谢谢 - 我添加了示例输出。 SUM() OVER() 是对匹配相同ZIPCode值的Households进行去重,Households的GROUP BY不能省略。 您希望拥有多少个公司/栏目?它只是很少还是数百甚至更多?解决方案将取决于它 【参考方案1】:

好的,假设您的初始代码确实适合您 - 下面将解决第二个查询的问题

SELECT 
  c1.ZIPCode AS ZIPCode, 
  c1.Company1 AS Company1, 
  c2.Company2 AS Company2, 
  c3.Company3 AS Company3
FROM (
  SELECT ZIPCode, SUM(SumHouseholds1) AS Company1  
  FROM (  
      SELECT ZIPCode, SUM(Households) OVER (PARTITION BY ZIPCode, DBAName) AS SumHouseholds1  
      FROM Business  
      JOIN Location  
      ON Location.ZIPCode = Business.ZIPCode  
      WHERE DBAName='Company1'  
  GROUP BY DBAName, ZIPCode, Households)
) AS c1
JOIN (
  SELECT ZIPCode, SUM(SumHouseholds2) AS Company2  
  FROM (  
      SELECT ZIPCode, SUM(Households) OVER (PARTITION BY ZIPCode, DBAName) AS SumHouseholds2  
      FROM Business  
      JOIN Location  
      ON Location.ZIPCode = Business.ZIPCode  
      WHERE DBAName='Company2'  
  GROUP BY DBAName, ZIPCode, Households)
) AS c2
ON c1.ZIPCode = c2.ZIPCode
JOIN (
  SELECT ZIPCode, SUM(SumHouseholds3) AS Company3  
  FROM (  
      SELECT ZIPCode, SUM(Households) OVER (PARTITION BY ZIPCode, DBAName) AS SumHouseholds3  
      FROM Business  
      JOIN Location  
      ON Location.ZIPCode = Business.ZIPCode  
      WHERE DBAName='Company3'  
  GROUP BY DBAName, ZIPCode, Households)
) AS c3
ON c1.ZIPCode = c3.ZIPCode

但是即使现在它可以工作(我希望因为我根本没有测试过它)它太重且难以管理 下面解决了这个问题(仍然没有经过测试,但应该可以工作,至少应该给你一个想法)

SELECT
  ZIPCode,
  SUM(CASE WHEN DBAName='Company1' THEN Company ELSE 0 END) AS Company1,
  SUM(CASE WHEN DBAName='Company2' THEN Company ELSE 0 END) AS Company2,
  SUM(CASE WHEN DBAName='Company3' THEN Company ELSE 0 END) AS Company3
FROM (
  SELECT ZIPCode, DBAName, SUM(SumHouseholds1) AS Company
  FROM (  
      SELECT ZIPCode, SUM(Households) OVER (PARTITION BY ZIPCode, DBAName) AS SumHouseholds  
      FROM Business  
      JOIN Location  
      ON Location.ZIPCode = Business.ZIPCode  
  GROUP BY DBAName, ZIPCode, Households)
)
GROUP BY ZIPCode

2016 年 7 月 12 日更新,基于 cmets 中的更多信息

SELECT
 ZIPCode,
 SUM(CASE WHEN DBAName='Company1' THEN Company ELSE 0 END) AS Company1,
 SUM(CASE WHEN DBAName='Company2' THEN Company ELSE 0 END) AS Company2,
 SUM(CASE WHEN DBAName='Company3' THEN Company ELSE 0 END) AS Company3
FROM (
 SELECT ZIPCode, DBAName, SUM(SumHouseholds) AS Company
 FROM (  
    SELECT ZIPCode, DBAName, SUM(Households) OVER (PARTITION BY ZIPCode, DBAName) AS SumHouseholds  
    FROM Business  
    JOIN Location  
    ON Location.ZIPCode = Business.Market  
    GROUP BY DBAName, ZIPCode, Households
 )
 GROUP BY DBAName, ZIPCode
)
GROUP BY ZIPCode

输出是

ZIPCode Company1    Company2    Company3     
10001    5           5           5   
10016    8           8           8   
12345   17          17          17   
16420   10           0           0   

进一步的想法

上面的“修复”仍然完全依赖于假设你的逻辑是正确的。

我确实有填充它不是:

我认为以下调整使其更正:

首先 - 按家庭分组看起来非常可疑,在查看您的笔记后,我认为您需要在下面

SELECT
  ZIPCode,
  SUM(CASE WHEN DBAName='Company1' THEN Company ELSE 0 END) AS Company1,
  SUM(CASE WHEN DBAName='Company2' THEN Company ELSE 0 END) AS Company2,
  SUM(CASE WHEN DBAName='Company3' THEN Company ELSE 0 END) AS Company3
FROM (  
  SELECT ZIPCode, DBAName, SUM(Households) AS Company  
  FROM (
    SELECT Market, DBAName 
    FROM AS Business 
    GROUP BY Market, DBAName
  ) AS Business
  JOIN Location  
  ON Location.ZIPCode = Business.Market  
  GROUP BY DBAName, ZIPCode
)
GROUP BY ZIPCode

这反过来 - 可以进一步简化为

SELECT
  ZIPCode,
  SUM(CASE WHEN DBAName='Company1' THEN Households ELSE 0 END) AS Company1,
  SUM(CASE WHEN DBAName='Company2' THEN Households ELSE 0 END) AS Company2,
  SUM(CASE WHEN DBAName='Company3' THEN Households ELSE 0 END) AS Company3
FROM (  
  SELECT Market, DBAName 
  FROM Business 
  GROUP BY Market, DBAName
) AS Business
JOIN Location  
ON Location.ZIPCode = Business.Market  
GROUP BY ZIPCode

不知何故,我的感觉 - 最后一个查询就是您要查找的内容!

但它仍然是一个选项,我只是不知道您真实的一些细节 - 很可能更复杂 - 用例,所以在这种情况下,您的原始逻辑可能是正确的

【讨论】:

再次感谢 - 两个版本都返回“SELECT 子句混合了聚合 'Company[3]' 和字段 [...] 没有 GROUP BY 子句”错误 - 两者都是字符串源表,但可能需要在查询中以某种方式定义 Company[1/2/3] 字段的数据类型。 第一个版本正是您的初始查询刚刚在邮政编码上加入了三次 - 如果您的原始代码确实有效(我问过您几次),除非您以某种方式对其进行修改,否则它也应该有效。如果需要进一步的帮助,您应该提供一些输入数据示例和输出! 谢谢 - 这里是注释、输入数据和输出:dropbox.com/s/lab1eigzefld94p/***.zip 您还需要这方面的帮助还是仅供参考? 我仍然对 .zip 文件的 queries.txt 中提到的问题感到困惑 - 除非窗口函数“OVER (PARTITION BY ZIPCode, DBAName)”,否则找不到字段“DBAName”已移除。是否有任何选项来制作 DBAName 的别名以避免“未找到”错误?任何帮助都会很棒!

以上是关于新列的多个 BigQuery 子选择的主要内容,如果未能解决你的问题,请参考以下文章

BigQuery 将新列添加到子查询中的嵌套 STRUCT

如何将新列动态添加到 bigquery 中已存在的表..?

Pandas列表的列,通过迭代(选择)三列的每个列表元素作为新列和行来创建多列[重复]

通过 Spark 使用 BigQuery Storage API:请求多个分区但仅获得 1 个

避免用于派生选择中的列的多个重复子查询

新列的 SQL 不同分组依据