使用窗口函数计算滚动计数

Posted

技术标签:

【中文标题】使用窗口函数计算滚动计数【英文标题】:Calculate the rolling count using window functions 【发布时间】:2019-07-02 01:26:06 【问题描述】:

我有一张表格,其中包含我们客户的订单: order_date:订单日期(不是唯一的,因为每个订单可能有多个产品) customer_id:不是唯一的

我想计算每个客户到当前 order_date 的订单数量,但由于 order_date 中有重复,结果不合理。

我在 Postgres 11.2 中使用窗口函数

CREATE TABLE "public"."orders" (
    "order_date" timestamp,
    "customer_id" integer
);

插入数据:

INSERT INTO "public"."orders"("order_date", "customer_id") VALUES('2018-12-13 20:45:24.571964', 402) RETURNING "order_date", "customer_id";
INSERT INTO "public"."orders"("order_date", "customer_id") VALUES('2018-12-13 20:45:24.571964', 402) RETURNING "order_date", "customer_id";
INSERT INTO "public"."orders"("order_date", "customer_id") VALUES('2018-10-12 20:08:39.635959', 466) RETURNING "order_date", "customer_id";
INSERT INTO "public"."orders"("order_date", "customer_id") VALUES('2018-11-04 22:15:14.905851', 483) RETURNING "order_date", "customer_id";
INSERT INTO "public"."orders"("order_date", "customer_id") VALUES('2018-11-04 22:15:14.905851', 483) RETURNING "order_date", "customer_id";
INSERT INTO "public"."orders"("order_date", "customer_id") 

我使用此代码生成我想要的,但它不起作用

select *,COALESCE(COUNT(*) OVER (partition by orders.customer_id order by orders.order_date range between interval '100 years' PRECEDING AND 
       CURRENT ROW EXCLUDE CURRENT ROW),0) AS 
       customer_orders_count_up_to_now,
       COALESCE(COUNT(*) OVER (partition by orders.customer_id order by 
       orders.order_date asc range BETWEEN interval '7 days' PRECEDING 
       AND CURRENT ROW EXCLUDE CURRENT ROW),0) AS 
       customer_orders_last_seven_days 
from orders

我希望 customer_orders_count_up_to_now 和 customer_orders_last_seven_days 的输出列为 0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0 但是,由于 order_date 中的重复,实际值有所不同。

【问题讨论】:

【参考方案1】:

如果我理解正确,您基本上希望 count(distinct) 作为窗口函数。 Postgres(还)不支持它。但是你可以在子查询中使用select distinct

select o.*,
       COALESCE(COUNT(*) OVER (partition by o.customer_id
                               order by o.order_date
                               range between interval '100 years' PRECEDING AND CURRENT ROW EXCLUDE CURRENT ROW),
                0) AS customer_orders_count_up_to_now,
       COALESCE(COUNT(*) OVER (partition by o.customer_id
                               order by o.order_date asc
                               range BETWEEN interval '7 days' PRECEDING AND CURRENT ROW EXCLUDE CURRENT ROW),
                0) AS customer_orders_last_seven_days 
from (SELECT DISTINCT o.customer_id, o.order_date from orders o) o

【讨论】:

感谢您的回答,我想要每条记录,它不考虑同一 order_date 中的订单,而是考虑另一个(按 customer_id 分区时)。事实上,我确实希望拥有所有记录,而不仅仅是不同的记录。【参考方案2】:

我已经找到了解决方案,如果其他人有同样的问题,我在这里分享:

select *,COALESCE(COUNT(*) OVER (partition by orders.customer_id order by orders.order_date range between interval '100 years' PRECEDING AND 
   CURRENT ROW EXCLUDE CURRENT ROW),0) - COALESCE(COUNT(*) OVER (partition by orders.customer_id,orders.order_date order by orders.order_date range between interval '100 years' PRECEDING AND 
   CURRENT ROW EXCLUDE CURRENT ROW),0) AS 
   customer_orders_count_up_to_now,
   COALESCE(COUNT(*) OVER (partition by orders.customer_id order by 
   orders.order_date asc range BETWEEN interval '7 days' PRECEDING 
   AND CURRENT ROW EXCLUDE CURRENT ROW),0) - COALESCE(COUNT(*) OVER (partition by orders.customer_id order,orders.order_date by 
   orders.order_date asc range BETWEEN interval '7 days' PRECEDING 
   AND CURRENT ROW EXCLUDE CURRENT ROW),0) AS 
   customer_orders_last_seven_days from orders

这个想法是,为了从滚动计数中去除重复计数,我们应该从计算的滚动计数中减去那些在 order_time 中有重复的记录的计数。

【讨论】:

以上是关于使用窗口函数计算滚动计数的主要内容,如果未能解决你的问题,请参考以下文章

pandas 基于值而不是计数的窗口滚动计算

pandas 基于值而不是计数的窗口滚动计算

pandas使用rolling函数计算dataframe指定数据列特定窗口下的滚动均值(rolling mean)自定义指定滚动窗口的大小(window size)

pandas使用rolling函数计算dataframe指定数据列特定窗口下的滚动加和值(rolling sum)自定义指定滚动窗口的大小(window size)

pandas使用rolling函数计算dataframe指定数据列特定窗口下的滚动最小值(rolling minimum)自定义指定滚动窗口的大小(window size)

pandas使用rolling函数计算dataframe指定数据列特定窗口下的滚动最大值(rolling maximum)自定义指定滚动窗口的大小(window size)