有没有办法在 COUNT 聚合分析函数中使用 ORDER BY 子句?如果不是,啥是合适的替代方案?

Posted

技术标签:

【中文标题】有没有办法在 COUNT 聚合分析函数中使用 ORDER BY 子句?如果不是,啥是合适的替代方案?【英文标题】:Is there a way to use ORDER BY clause in COUNT aggregate analytic function? If not, what is a suitable alternative?有没有办法在 COUNT 聚合分析函数中使用 ORDER BY 子句?如果不是,什么是合适的替代方案? 【发布时间】:2020-06-25 18:52:07 【问题描述】:

我有一个如下所示的订单表:

WITH my_table_of_orders AS (
  SELECT
    1 AS order_id,
    DATE(2019, 5, 12) AS date,
    5 AS customer_id,
    TRUE AS is_from_particular_store

  UNION ALL SELECT
    2 AS order_id,
    DATE(2019, 5, 11) AS date,
    5 AS customer_id,
    TRUE AS is_from_particular_store

  UNION ALL SELECT
    3 AS order_id,
    DATE(2019, 5, 11) AS date,
    4 AS customer_id,
    FALSE AS is_from_particular_store
)

我的实际表包含约 5900 万行。

我想要做的实际上是在每个订单日期返回一行,第二列表示过去一年(相对于当前行的日期)下订单的客户百分比特定的商店(我虚构的is_from_particular_store 专栏就派上用场了)。

理想情况下,我可以使用以下查询而不会遇到资源问题。唯一的问题是在分析函数中使用DISTINCT 时,您不能使用ORDER BY,我得到了这个Window ORDER BY is not allowed if DISTINCT is specified

SELECT
  date,
  last_year_customer_id_that_ordered_from_a_particular_store / last_year_customer_id_that_ordered AS number_i_want
FROM (
  SELECT
    date,
    ROW_NUMBER() OVER (
      PARTITION BY
        date
    ) AS row_num,
    COUNT(DISTINCT customer_id) OVER(
      ORDER BY
        UNIX_SECONDS(TIMESTAMP(date))
      -- 31,536,000 = 365 days in seconds
      RANGE BETWEEN 31536000 PRECEDING AND CURRENT ROW
    ) AS last_year_customer_id_that_ordered,
    COUNT(DISTINCT IF(is_from_particular_store, customer_id, NULL)) OVER(
      ORDER BY
        UNIX_SECONDS(TIMESTAMP(date))
      -- 31,536,000 = 365 days in seconds
      RANGE BETWEEN 31536000 PRECEDING AND CURRENT ROW
    ) AS last_year_customer_id_that_ordered_from_a_particular_store,
  FROM my_table_of_orders
)
WHERE
  -- only return one row per date
  row_num = 1

然后我尝试改用ARRAY_AGGUNNEST

SELECT
  date,
  SAFE_DIVIDE((SELECT COUNT(DISTINCT customer_id)
    FROM UNNEST(last_year_customer_id_that_ordered_from_a_particular_store) AS customer_id
  ), (SELECT COUNT(DISTINCT customer_id)
    FROM UNNEST(last_year_customer_id_that_ordered) AS customer_id
  )) AS number_i_want_to_calculate
FROM (
  SELECT
    date,
    ROW_NUMBER() OVER (
      PARTITION BY
        date
    ) AS row_num,
    ARRAY_AGG(customer_id) OVER(
      ORDER BY
        UNIX_SECONDS(TIMESTAMP(date))
      -- 31,536,000 = 365 days in seconds
      RANGE BETWEEN 31536000 PRECEDING AND CURRENT ROW
    ) AS last_year_customer_id_that_ordered,
    ARRAY_AGG(IF(is_from_particular_store, customer_id, NULL)) OVER(
      ORDER BY
        UNIX_SECONDS(TIMESTAMP(date))
      -- 31,536,000 = 365 days in seconds
      RANGE BETWEEN 31536000 PRECEDING AND CURRENT ROW
    ) AS last_year_customer_id_that_ordered_from_a_particular_store,
  FROM my_table_of_orders
)
WHERE
  -- only return one row per date
  row_num = 1

唯一的问题是我遇到以下资源问题...

Resources exceeded during query execution: The query could not be executed in the allotted memory.

这个问题与https://***.com/a/42567839/3902555 非常相似,建议使用ARRAY_AGG + UNNEST,但就像我说的那样,这会给我带来资源问题:(

有谁知道一种更节省资源的方法来计算我所追求的统计数据?

【问题讨论】:

【参考方案1】:

另一个完全重构的版本(BigQuery 标准 SQL)

#standardSQL
WITH temp AS (
  SELECT DISTINCT DATE, customer_id, is_from_particular_store
  FROM my_table_of_orders
)
SELECT a.date, 
  SAFE_DIVIDE(
    COUNT(DISTINCT IF(b.is_from_particular_store, b.customer_id, NULL)),
    COUNT(DISTINCT b.customer_id)
  ) AS number_i_want_to_calculate
FROM temp a
CROSS JOIN temp b
WHERE DATE_DIFF(a.date, b.date, YEAR) < 1
GROUP BY a.date   

上面的替代方法是使用Approximate Aggregation,如下例所示

#standardSQL
WITH temp AS (
  SELECT DISTINCT DATE, customer_id, is_from_particular_store
  FROM my_table_of_orders
)
SELECT a.date, 
  SAFE_DIVIDE(
    APPROX_COUNT_DISTINCT(IF(b.is_from_particular_store, b.customer_id, NULL)),
    APPROX_COUNT_DISTINCT(b.customer_id)
  ) AS number_i_want_to_calculate
FROM temp a
CROSS JOIN temp b
WHERE DATE_DIFF(a.date, b.date, YEAR) < 1
GROUP BY a.date

【讨论】:

20 分钟后它工作了(你的第二个版本)!我会接受这个作为接受的答案!我认为为了加快速度,我将不得不尝试以某种特定于我的业务逻辑的方式缩减我的输入表。 当然。说得通。很高兴它对你有效。第一个版本怎么样 - 它还在运行吗?只是好奇:o) 我在 20 分钟后停止了它,但我会再试一次,让你知道 所以我在 1 小时 25 分钟后杀死了它。没有抱怨资源错误,但我变得不耐烦了哈哈 感谢您的更新。它仍然有意义 - 至少你现在在上面的答案中有工作版本。另一个可能仍然对较小的集合有用【参考方案2】:

以下是 BigQuery 标准 SQL

尝试下面的小重构版本,主要基于在同一日期对客户进行第一次重复数据删除并删除使用 ROW_NUMBER(),这通常是大量的资源消耗 显然无法测试您的真实数据,所以不知道这是否足以进一步改进 - 所以请尝试让我们知道

#standardSQL
SELECT DISTINCT DATE,
  SAFE_DIVIDE(
    (SELECT COUNT(DISTINCT customer_id) FROM UNNEST(last_year_customer_id_that_ordered_from_a_particular_store) AS customer_id), 
    (SELECT COUNT(DISTINCT customer_id) FROM UNNEST(last_year_customer_id_that_ordered) AS customer_id)
  ) AS number_i_want_to_calculate
FROM (
  SELECT DATE,  
    ARRAY_AGG(customer_id) OVER(win) AS last_year_customer_id_that_ordered,
    ARRAY_AGG(IF(is_from_particular_store, customer_id, NULL)) OVER(win) AS last_year_customer_id_that_ordered_from_a_particular_store,
  FROM (
    SELECT DISTINCT DATE, customer_id, is_from_particular_store
    FROM my_table_of_orders
  ) 
  WINDOW win AS (ORDER BY UNIX_SECONDS(TIMESTAMP(DATE)) RANGE BETWEEN 31536000 PRECEDING AND CURRENT ROW)
)

【讨论】:

感谢您的回答。实际上尝试在此标记您,因为我看到您为我列出的另一个问题提供了简洁的答案。到目前为止,由于资源问题,您的查询还没有超时,但它在大约 18 分钟后仍在运行,哈哈。 我认为这很有意义,因为逻辑非常繁重,所以让我们看看它是否会完成或再次失败。同时,试试我刚刚添加的另一个版本 :o)

以上是关于有没有办法在 COUNT 聚合分析函数中使用 ORDER BY 子句?如果不是,啥是合适的替代方案?的主要内容,如果未能解决你的问题,请参考以下文章

Python 操作Redis

python爬虫入门----- 阿里巴巴供应商爬虫

Python词典设置默认值小技巧

《python学习手册(第4版)》pdf

Django settings.py 的media路径设置

Python中的赋值,浅拷贝和深拷贝的区别