(Presto) 窗口函数“OVER”子句中“ROWS BETWEEN”和“RANGE BETWEEN”的区别

Posted

技术标签:

【中文标题】(Presto) 窗口函数“OVER”子句中“ROWS BETWEEN”和“RANGE BETWEEN”的区别【英文标题】:Difference between "ROWS BETWEEN" and "RANGE BETWEEN" in (Presto) window function "OVER" clause 【发布时间】:2020-06-03 17:49:01 【问题描述】:

这个问题主要是关于 PrestoSQL 的旧版本,这些问题已经在(现在更名的)Trino 项目中解决了 346. 然而,亚马逊的 Athena 项目基于 Presto 版本 0.217(Athena 引擎 2)和 0.172(Athena 引擎 1),它们确实存在下述问题。这个问题是专门围绕 Athena Engine 1 / PrestoSQL 版本 0.172 编写的

问题(tl;dr)

    Presto 窗口函数中的ROWS BETWEENRANGE BETWEEN 有什么区别? 这些只是彼此的同义词,还是存在核心概念差异? 如果它们只是同义词,为什么ROWS BETWEEN 允许的选项比RANGE BETWEEN 多? 是否存在可以在ROWS BETWEENRANGE BETWEEN 上使用完全相同的参数并获得不同结果的查询场景? 如果只使用unbounded/current row,是否存在使用RANGE 而不是ROWS(反之亦然)的情况? 既然ROWS 有更多选项,为什么文档中根本没有提到呢? o_O

评论

presto documentation 甚至对RANGE 都相当安静,并且没有提及ROWS。我没有在 Presto 中找到很多关于窗口函数的讨论或示例。我开始设置 Presto 代码库来尝试解决这个问题。希望有人能把我从这件事中解救出来,我们可以一起改进文档。

Presto 代码有a parser 和test cases 用于ROWS 变体,但有no mention in the documentation 和ROWS

我发现ROWSRANGE 的test cases 没有测试两种语法之间的任何不同。

它们几乎看起来像同义词,但在我的测试中它们的行为确实不同,并且具有不同的 allowed parameters 和 validation rules。

以下示例可以使用运行 Presto 0.213-e-0.1 的 starburstdata/presto Docker 映像运行。通常我通过 Amazon Athena 运行 Presto 0.172,并且几乎总是最终使用 ROWS

范围

RANGE 似乎仅限于“UNBOUNDED”和“CURRENT ROW”。以下返回错误:

range between 1 preceding and 1 following

use tpch.tiny;

select custkey, orderdate,
       array_agg(orderdate) over ( 
           partition by custkey 
           order by orderdate asc 
           range between 1 preceding and 1 following
       ) previous_orders 
from orders where custkey in (419, 320) and orderdate < date('1996-01-01')
order by custkey, orderdate asc;

错误: Window frame RANGE PRECEDING is only supported with UNBOUNDED

以下范围语法可以正常工作(预期不同的结果)。 以下所有示例均基于上述查询,只是更改了范围

range between unbounded preceding and current row

 custkey | orderdate  |                             previous_orders
---------+------------+--------------------------------------------------------------------------
     320 | 1992-07-10 | [1992-07-10]
     320 | 1992-07-30 | [1992-07-10, 1992-07-30]
     320 | 1994-07-08 | [1992-07-10, 1992-07-30, 1994-07-08]
     320 | 1994-08-04 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04]
     320 | 1994-09-18 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18]
     320 | 1994-10-12 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     419 | 1992-03-16 | [1992-03-16]
     419 | 1993-12-29 | [1992-03-16, 1993-12-29]
     419 | 1995-01-30 | [1992-03-16, 1993-12-29, 1995-01-30]

range between current row and unbounded following

 custkey | orderdate  |                             previous_orders
---------+------------+--------------------------------------------------------------------------
     320 | 1992-07-10 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1992-07-30 | [1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-07-08 | [1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-08-04 | [1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-09-18 | [1994-09-18, 1994-10-12]
     320 | 1994-10-12 | [1994-10-12]
     419 | 1992-03-16 | [1992-03-16, 1993-12-29, 1995-01-30]
     419 | 1993-12-29 | [1993-12-29, 1995-01-30]
     419 | 1995-01-30 | [1995-01-30]

无限前导和无限后继之间的范围

 custkey | orderdate  |                             previous_orders
---------+------------+--------------------------------------------------------------------------
     320 | 1992-07-10 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1992-07-30 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-07-08 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-08-04 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-09-18 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-10-12 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     419 | 1992-03-16 | [1992-03-16, 1993-12-29, 1995-01-30]
     419 | 1993-12-29 | [1992-03-16, 1993-12-29, 1995-01-30]
     419 | 1995-01-30 | [1992-03-16, 1993-12-29, 1995-01-30]

RANGE 的三个工作示例都适用于ROWS,并产生相同的输出。

rows between unbounded preceding and current row
rows between current row and unbounded following
rows between unbounded preceding and unbounded following

输出省略 - 与上面相同

但是,ROWS 允许更多的控制,因为您也可以执行上述 range 失败的语法:

rows between 1 preceding and 1 following

 custkey | orderdate  |           previous_orders
---------+------------+--------------------------------------
     320 | 1992-07-10 | [1992-07-10, 1992-07-30]
     320 | 1992-07-30 | [1992-07-10, 1992-07-30, 1994-07-08]
     320 | 1994-07-08 | [1992-07-30, 1994-07-08, 1994-08-04]
     320 | 1994-08-04 | [1994-07-08, 1994-08-04, 1994-09-18]
     320 | 1994-09-18 | [1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-10-12 | [1994-09-18, 1994-10-12]
     419 | 1992-03-16 | [1992-03-16, 1993-12-29]
     419 | 1993-12-29 | [1992-03-16, 1993-12-29, 1995-01-30]
     419 | 1995-01-30 | [1993-12-29, 1995-01-30]

rows between current row and 1 following

 custkey | orderdate  |     previous_orders
---------+------------+--------------------------
     320 | 1992-07-10 | [1992-07-10, 1992-07-30]
     320 | 1992-07-30 | [1992-07-30, 1994-07-08]
     320 | 1994-07-08 | [1994-07-08, 1994-08-04]
     320 | 1994-08-04 | [1994-08-04, 1994-09-18]
     320 | 1994-09-18 | [1994-09-18, 1994-10-12]
     320 | 1994-10-12 | [1994-10-12]
     419 | 1992-03-16 | [1992-03-16, 1993-12-29]
     419 | 1993-12-29 | [1993-12-29, 1995-01-30]
     419 | 1995-01-30 | [1995-01-30]

rows between 5 preceding and 2 preceding

 custkey | orderdate  |                 previous_orders
---------+------------+--------------------------------------------------
     320 | 1992-07-10 | NULL
     320 | 1992-07-30 | NULL
     320 | 1994-07-08 | [1992-07-10]
     320 | 1994-08-04 | [1992-07-10, 1992-07-30]
     320 | 1994-09-18 | [1992-07-10, 1992-07-30, 1994-07-08]
     320 | 1994-10-12 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04]
     419 | 1992-03-16 | NULL
     419 | 1993-12-29 | NULL
     419 | 1995-01-30 | [1992-03-16]

【问题讨论】:

range 用于定义覆盖the last 6 months 之类的内容的窗口,无论包含多少行。但我不知道 Presto。 @a_horse_with_no_name 所以听起来其他 SQL 引擎/语法允许 RANGE 基于列值,Presto 似乎不支持。如果是这种情况,那么 Presto 应该更倾向于记录 ROWS 函数,因为这基本上就是它所支持的全部内容。这确实是有道理的——“UNBOUNDED”在两种情况下都是一样的。 解释见这里:modern-sql.com/blog/2019-02/postgresql-11#over sqlitetutorial.net/sqlite-window-functions/sqlite-window-frame SQLite 的这个解释也非常符合 Presto 的使用行为。 只是想插话并感谢您的体贴、勤奋和对本文清晰度的承诺。这是我们在拼命寻找 SO 时都希望找到的“宝石”之一。赞一个! 【参考方案1】: ROWS 实际上是您要聚合的前后的行数。所以ORDER BY day ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING 会以 3 行结束:curnet 行前 1 行和后 1 行,无论 orderdate 的值如何。 RANGE 将查看 orderdate 的值,并决定哪些应该聚合,哪些不应该聚合。所以ORDER BY day RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING 理论上会占用所有值为 orderdate-1、orderdate 和 orderdate+1 的行——这可能超过 3 行(参见更多解释 here)

在 Presto 中,ROWS 已完全实现,但 RANGE 不知何故仅部分实现,您只能与 CURRENT ROWUNBOUNDED 一起使用。

注意: Trino (formerly known as Presto SQL) 的最新版本已完整 支持RANGEGROUPS 框架。见this blog post 解释它们的工作原理。

在 Presto 中,要查看两者之间的差异,最好的方法是确保 order 子句的值相同:

WITH
   tt1  (custkey, orderdate, product) AS 
      ( SELECT * FROM ( VALUES ('a','1992-07-10', 3), ('a','1993-08-10', 4), ('a','1994-07-13', 5), ('a','1995-09-13', 5), ('a','1995-09-13', 9), ('a','1997-01-13', 4),
                               ('b','1992-07-10', 6), ('b','1992-07-10', 4), ('b','1994-07-13', 5), ('b','1994-07-13', 9), ('b','1998-11-11', 9) )  )

SELECT *, 
       array_agg(product) OVER (partition by custkey) c, 
       array_agg(product) OVER (partition by custkey order by orderdate) c_order,
       
       array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) range_ubub,
       array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rows_ubub,
       
       array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) range_ubc,
       array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) rows_ubc,
       
       array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) range_cub,
       array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) rows_cub,
       
       -- array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN 2 PRECEDING AND 2 FOLLOWING)  range22,
          -- SYNTAX_ERROR: line 19:65: Window frame RANGE PRECEDING is only supported with UNBOUNDED
       array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN 2 PRECEDING AND 2 FOLLOWING)  rows22

from tt1
order by custkey, orderdate, product

您可以运行,查看完整结果,并从中学习..

这里只放一些有趣的专栏:

custkey   orderdate     product    range_ubc           rows_ubc
a         10/07/1992    3          [3]                 [3]
a         10/08/1993    4          [3, 4]              [3, 4]
a         13/07/1994    5          [3, 4, 5]           [3, 4, 5]
a         13/09/1995    5          [3, 4, 5, 5, 9]     [3, 4, 5, 5]
a         13/09/1995    9          [3, 4, 5, 5, 9]     [3, 4, 5, 5, 9]
a         13/01/1997    4          [3, 4, 5, 5, 9, 4]  [3, 4, 5, 5, 9, 4]
b         10/07/1992    4          [6, 4]              [6, 4]
b         10/07/1992    6          [6, 4]              [6]
b         13/07/1994    5          [6, 4, 5, 9]        [6, 4, 5]
b         13/07/1994    9          [6, 4, 5, 9]        [6, 4, 5, 9]
b         11/11/1998    9          [6, 4, 5, 9, 9]     [6, 4, 5, 9, 9]

如果您查看第 5 行:orderdate:13/09/1995, product:5注意13/09/1995 出现 两次custkey:a)您可以看到 ROWS确实将所有行从顶部到当前行。但是,如果您查看 RANGE,您会发现它还包含行 after 中的值,因为它具有完全相同的 orderdate,因此它被 考虑窗口。

【讨论】:

谢谢!非常清楚:) 如果整个窗口已满,是否也可以只进行计算?或者您应该为此将其包装在case when 中吗? :// 谢谢。除了将其包装在 case when.. 中之外,我不知道有任何其他方式 看到 Trino(以前的 Presto)更全面地实现了该功能,这令人鼓舞——谢谢。就上下文而言,亚马逊的 Athena 产品仍然基于 Presto 0.217(Athena 引擎 v2)或 Presto 0.172(Athena 引擎 v1)构建,因此 AWS Athena 的功能仍然有限。希望他们在 v3 引擎的版本中实现更大的飞跃,我们可以在那里看到这些增强功能。

以上是关于(Presto) 窗口函数“OVER”子句中“ROWS BETWEEN”和“RANGE BETWEEN”的区别的主要内容,如果未能解决你的问题,请参考以下文章

如何计算 Presto 中每 n 行的窗口函数?

如何为 Presto 编写自定义窗口函数?

SQLSQL常见窗口函数整理汇总大全(用到over的场景)

在spark sql中对窗口函数使用having子句的语义是什么?

在 OVER 子句中使用 ORDER BY

SQL函数