如何在 Snowflake sql 中使用 partition by 和 order by 计算不同的值?

Posted

技术标签:

【中文标题】如何在 Snowflake sql 中使用 partition by 和 order by 计算不同的值?【英文标题】:How to count distinct value with partition by and order by in Snowflake sql? 【发布时间】:2020-12-09 01:50:47 【问题描述】:

我的数据如下:

| user | eventorder| postal|
|:---- |:---------:| -----:|
| A    | 1         | 60616 |
| A    | 2         | 10000 |
| A    | 3         | 60616 |
| B    | 1         | 20000 |
| B    | 2         | 30000 |
| B    | 3         | 40000 |
| B    | 4         | 30000 |
| B    | 5         | 20000 |

我需要解决的问题:有多少不同的停靠点,直到用户旅行的每个事件顺序?

理想的结果应该是这样的:

| user | eventorder| postal| travelledStop|
|:---- |:---------:| -----:| ------------:|
| A    | 1         | 60616 |  1    |
| A    | 2         | 10000 |  2    |
| A    | 3         | 60616 |  2    |
| B    | 1         | 20000 |  1    |
| B    | 2         | 30000 |  2    |
| B    | 3         | 40000 |  3    |
| B    | 4         | 30000 |  3    |
| B    | 5         | 20000 |  3    |

以 A 为例,当事件顺序为 1 时,它仅行进 60616 - 1 站。 当事件顺序为 2 时,它已行驶 60616 和 10000 - 2 站。 当事件顺序为 3 时,此用户经过的不同站点是 60616 和 10000。- 2 个站点。

我不允许将 count distinct 与 partition by order by 一起使用。我想做一些类似 count(distinct(postal)) 的事情(按用户顺序按 eventorder 分区),但这是不允许的。

有谁知道如何解决这个问题?非常感谢!

【问题讨论】:

【参考方案1】:

我使用了您提供的示例数据(只是 A 的一个子集,但这应该可以扩展)。这里的目标本质上是为每一行生成一个数组,该数组累积了之前事件的所有邮件。

with _temp as (
select 'A' as usr, 1 as EventOrder, '60616' as Postal
UNION ALL
select 'A' as usr, 2 as EventOrder, '10000' as Postal
UNION ALL
select 'A' as usr, 3 as EventOrder, '60616' as Postal
),
_intermediate as (
select usr
    , eventorder
    , postal
    , array_slice(
          array_agg(postal)
            within group (order by eventorder)
            OVER (Partition by usr)
           , 0, eventorder) as full_array
from _temp
group by usr, eventorder, postal
)
select usr, eventorder, postal, count(distinct f.value)
from _intermediate i, lateral flatten(input => i.full_array) f
group by usr, eventorder, postal

【讨论】:

非常好的解决方案!我试图做同样的事情,但不知道如何为每一行构建数组(正在考虑窗口框架,但不支持),但 array_slice() 是一个很好的方法。【参考方案2】:

也许最简单的方法是使用子查询并计算“1”:

select t.*,
       sum(case when seqnum = 1 then 1 else 0 end) over (partition by usr order by eventorder) as num_postals
from (select t.*,
             row_number() over (partition by usr, postal order by eventorder) as seqnum
      from t
     ) t

【讨论】:

【参考方案3】:

我喜欢@Daniel Zagales 的回答,但这里是使用dense_ranksum 的解决方法

with temp as (
select 'A' as usr, 1 as EventOrder, '60616' as Postal
UNION ALL
select 'A' as usr, 2 as EventOrder, '10000' as Postal
UNION ALL
select 'A' as usr, 3 as EventOrder, '60616' as Postal  
UNION ALL
select 'B' as usr, 1 as EventOrder, '20000' as Postal  
UNION ALL
select 'B' as usr, 2 as EventOrder, '30000' as Postal  
UNION ALL
select 'B' as usr, 3 as EventOrder, '40000' as Postal 
UNION ALL
select 'B' as usr, 4 as EventOrder, '30000' as Postal  
UNION ALL
select 'B' as usr, 5 as EventOrder, '20000' as Postal 
),
temp2 as(
select temp.* ,dense_rank()over(partition by usr,Postal order by EventOrder) rks
from temp 
)
select usr,eventorder,postal,sum(case when rks = 1 then 1 else 0 END)over(partition by usr order by EventOrder) travelledStop
from temp2 
order by usr,EventOrder 

基本上使用dense_rank 得到第一个出现的停止而不是总结。

db<>fiddle

【讨论】:

以上是关于如何在 Snowflake sql 中使用 partition by 和 order by 计算不同的值?的主要内容,如果未能解决你的问题,请参考以下文章

Snowflake SQL:如何使用 JSON 对象循环遍历数组,以查找符合条件的项目

如何在 Snowflake 中编写等效的 IF ELSE adhoc sql 查询

在 Snowflake 中处理多个 SQL 语句的存储过程

如何使用 Snowflake sql 查询的结果填充 pandas DataFrame?

在 Snowflake 中使用 SQL 进行漏斗分析

如何从 Snowflake SQL 查询创建 PySpark pandas-on-Spark DataFrame?