使用standardsql在bigquery中选择不同的值

Posted

技术标签:

【中文标题】使用standardsql在bigquery中选择不同的值【英文标题】:select distinct values in bigquery using standardsql 【发布时间】:2017-07-26 16:19:18 【问题描述】:

我想选择多个列并将电子邮件与GROUP BY分组

#standardSQL
SELECT
      customers.orderCustomerEmail AS email,      
      customers.orderCustomerNumber AS customerNumber,
      customers.billingFirstname AS billingFirstname,
      customers.billingLastname AS billingLastname
FROM dim_customers AS customers
GROUP BY customers.orderCustomerEmail

失败:

Error: SELECT list expression references customers.orderCustomerNumber
       which is neither grouped nor aggregated at [4:7]

这类似于这个问题Bigquery select distinct values

但这并不能解决我的问题,因为将所有列添加到GROUP BYSELECT DISTINCT 相同

dim_customer 架构:

orderCustomerEmail:STRING,
billingFirstname:STRING,
billingLastname:STRING,
orderCustomerNumber:STRING,
OrderNumber:STRING

虚拟数据:https://docs.google.com/spreadsheets/d/1T1JZRWni18hhU4tO-9kQqq5Y3hVWgpP-aE7o6ij9bDE/edit?usp=sharing

【问题讨论】:

您的意思是“将所有列添加到分组依据是不同的结果”?你有一个你想要什么作为输入和结果的例子吗? 因为在 bigquery 中按几列分组,例如email 和 firstname 返回 a@email.com, Alex 和 a@email.com, A. 但在这种情况下我只需要一个结果 【参考方案1】:

当您按某些列分组时,您需要确保将一些聚合函数应用于其余列。否则你会得到你在问题中显示的确切错误

试试下面的 BigQuery 标准 SQL 示例

#standardSQL
SELECT 
  customers.orderCustomerEmail AS email,      
  ARRAY_AGG(STRUCT(customers.orderCustomerNumber AS customerNumber,
  customers.billingFirstname AS billingFirstname,
  customers.billingLastname AS billingLastname)) AS info
FROM `dim_customers`, UNNEST(customers) AS customers
GROUP BY email

或者只是简单的 DISTINCT

#standardSQL
SELECT DISTINCT 
  customers.orderCustomerEmail AS email,      
  customers.orderCustomerNumber AS customerNumber,
  customers.billingFirstname AS billingFirstname,
  customers.billingLastname AS billingLastname
FROM `dim_customers`, UNNEST(customers) AS customers

请注意:就您期望的输出而言,您的问题不够具体,因此很可能需要对您的特定需求进行一些调整

更新

我基本上每个客户需要一行(电子邮件是唯一标识符,因此是组)详细信息(号码、名字、姓氏)可以从最后一个条目中获取,例如

#standardSQL
WITH `dim_customers` AS (
  SELECT [
    STRUCT('a' AS orderCustomerEmail, 1 AS orderCustomerNumber, 'af' AS billingFirstname, 'al' AS billingLastname),
    STRUCT('a' AS orderCustomerEmail, 4 AS orderCustomerNumber, 'af1' AS billingFirstname, 'al2' AS billingLastname),
    STRUCT('b' AS orderCustomerEmail, 2 AS orderCustomerNumber, 'bf' AS billingFirstname, 'bl' AS billingLastname),
    STRUCT('c' AS orderCustomerEmail, 3 AS orderCustomerNumber, 'cf' AS billingFirstname, 'cl' AS billingLastname)
    ] AS customers UNION ALL
  SELECT [
    STRUCT('a' AS orderCustomerEmail, 1 AS orderCustomerNumber, 'af' AS billingFirstname, 'al' AS billingLastname),
    STRUCT('a' AS orderCustomerEmail, 4 AS orderCustomerNumber, 'af1' AS billingFirstname, 'al2' AS billingLastname),
    STRUCT('b' AS orderCustomerEmail, 2 AS orderCustomerNumber, 'bf' AS billingFirstname, 'bl' AS billingLastname),
    STRUCT('c' AS orderCustomerEmail, 3 AS orderCustomerNumber, 'cf' AS billingFirstname, 'cl' AS billingLastname)
    ] AS customers
)
SELECT
  customers.orderCustomerEmail AS email,      
  ARRAY_AGG(STRUCT(customers.orderCustomerNumber AS customerNumber,
    customers.billingFirstname AS billingFirstname,
    customers.billingLastname AS billingLastname))[OFFSET(0)] AS info
FROM `dim_customers`, UNNEST(customers) AS customers
GROUP BY email

更新

以下是更新的架构!

dim_customer 架构:

orderCustomerEmail:STRING, billingFirstname:STRING, billingLastname:STRING, orderCustomerNumber:STRING, 订单号:STRING

#standardSQL
WITH `dim_customers` AS (
  SELECT 10201 AS orderCustomerNumber, 'a@email.com' AS orderCustomerEmail, 'Alex' AS billingFirstname, 'Miller' AS billingLastname UNION ALL
  SELECT 10202, 'b@email.com', 'Ben', 'Williams' UNION ALL
  SELECT 10203, 'c@email.com', 'Chris', 'Collins' UNION ALL
  SELECT 10204, 'd@email.com', 'David', 'Hems' UNION ALL
  SELECT 10201, 'a@email.com', 'A.', 'Miller' UNION ALL
  SELECT 10201, 'a@email.com', 'A.', 'Miller' UNION ALL
  SELECT 10202, 'b@email.com', 'Ben', 'Williams' UNION ALL
  SELECT 10202, 'b@email.com', 'Bens Father', 'Williams' UNION ALL
  SELECT 10205, 'a@email.com', 'A.', 'Miller' UNION ALL
  SELECT 10206, 'e@email.com', 'Ed', 'Winchell'
)
SELECT info.* FROM (
  SELECT
    orderCustomerEmail AS email, 
    ARRAY_AGG(STRUCT(
      orderCustomerEmail AS email, 
      orderCustomerNumber AS customerNumber,
      billingFirstname AS billingFirstname,
      billingLastname AS billingLastname))[OFFSET(0)] AS info
  FROM `dim_customers`
  GROUP BY email
)
-- ORDER BY email

【讨论】:

聚合的好主意,我当时正在努力将表格再次展平,此时 UNNEST(customers) 也不起作用。在输出方面,我基本上需要每个客户一行(电子邮件是唯一标识符,因此是组)详细信息(号码、名字、姓氏)可以从最后一个条目中获取,例如 我用虚拟数据对其进行了测试,它按预期工作。如果您在使上述代码正常工作时遇到问题 - 请提供有关您的数据的更多详细信息并阐明您遇到的错误类型 我只是得到一个错误:无法识别的名称:客户在 [7:33],但客户也只是在那之后定义的。 所以你应该显示你的表的模式。我的假设可能是错误的 在我的回答中查看更新 - 添加了完全虚拟的数据,以便您可以使用它 - 并添加了每个客户仅选择一个详细信息条目的选项。希望这能让您更好地了解使用过的“技术”

以上是关于使用standardsql在bigquery中选择不同的值的主要内容,如果未能解决你的问题,请参考以下文章

使用 bigquery 下载 sentinel-2 数据

Bigquery (Standard Sql) - 年月日期格式

Bigquery SQL 中的拆分函数

在 BigQuery 中计算百分位数

PARSE_DATE : 解析函数 BigQuery/Standard SQL 的结果无效

在 BigQuery 中展平嵌套和重复的结构(标准 SQL)