使用窗口函数计算滚动计数
Posted
技术标签:
【中文标题】使用窗口函数计算滚动计数【英文标题】:Calculate the rolling count using window functions 【发布时间】:2019-07-02 01:26:06 【问题描述】:我有一张表格,其中包含我们客户的订单: order_date:订单日期(不是唯一的,因为每个订单可能有多个产品) customer_id:不是唯一的
我想计算每个客户到当前 order_date 的订单数量,但由于 order_date 中有重复,结果不合理。
我在 Postgres 11.2 中使用窗口函数
CREATE TABLE "public"."orders" (
"order_date" timestamp,
"customer_id" integer
);
插入数据:
INSERT INTO "public"."orders"("order_date", "customer_id") VALUES('2018-12-13 20:45:24.571964', 402) RETURNING "order_date", "customer_id";
INSERT INTO "public"."orders"("order_date", "customer_id") VALUES('2018-12-13 20:45:24.571964', 402) RETURNING "order_date", "customer_id";
INSERT INTO "public"."orders"("order_date", "customer_id") VALUES('2018-10-12 20:08:39.635959', 466) RETURNING "order_date", "customer_id";
INSERT INTO "public"."orders"("order_date", "customer_id") VALUES('2018-11-04 22:15:14.905851', 483) RETURNING "order_date", "customer_id";
INSERT INTO "public"."orders"("order_date", "customer_id") VALUES('2018-11-04 22:15:14.905851', 483) RETURNING "order_date", "customer_id";
INSERT INTO "public"."orders"("order_date", "customer_id")
我使用此代码生成我想要的,但它不起作用
select *,COALESCE(COUNT(*) OVER (partition by orders.customer_id order by orders.order_date range between interval '100 years' PRECEDING AND
CURRENT ROW EXCLUDE CURRENT ROW),0) AS
customer_orders_count_up_to_now,
COALESCE(COUNT(*) OVER (partition by orders.customer_id order by
orders.order_date asc range BETWEEN interval '7 days' PRECEDING
AND CURRENT ROW EXCLUDE CURRENT ROW),0) AS
customer_orders_last_seven_days
from orders
我希望 customer_orders_count_up_to_now 和 customer_orders_last_seven_days 的输出列为 0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0 但是,由于 order_date 中的重复,实际值有所不同。
【问题讨论】:
【参考方案1】:如果我理解正确,您基本上希望 count(distinct)
作为窗口函数。 Postgres(还)不支持它。但是你可以在子查询中使用select distinct
:
select o.*,
COALESCE(COUNT(*) OVER (partition by o.customer_id
order by o.order_date
range between interval '100 years' PRECEDING AND CURRENT ROW EXCLUDE CURRENT ROW),
0) AS customer_orders_count_up_to_now,
COALESCE(COUNT(*) OVER (partition by o.customer_id
order by o.order_date asc
range BETWEEN interval '7 days' PRECEDING AND CURRENT ROW EXCLUDE CURRENT ROW),
0) AS customer_orders_last_seven_days
from (SELECT DISTINCT o.customer_id, o.order_date from orders o) o
【讨论】:
感谢您的回答,我想要每条记录,它不考虑同一 order_date 中的订单,而是考虑另一个(按 customer_id 分区时)。事实上,我确实希望拥有所有记录,而不仅仅是不同的记录。【参考方案2】:我已经找到了解决方案,如果其他人有同样的问题,我在这里分享:
select *,COALESCE(COUNT(*) OVER (partition by orders.customer_id order by orders.order_date range between interval '100 years' PRECEDING AND
CURRENT ROW EXCLUDE CURRENT ROW),0) - COALESCE(COUNT(*) OVER (partition by orders.customer_id,orders.order_date order by orders.order_date range between interval '100 years' PRECEDING AND
CURRENT ROW EXCLUDE CURRENT ROW),0) AS
customer_orders_count_up_to_now,
COALESCE(COUNT(*) OVER (partition by orders.customer_id order by
orders.order_date asc range BETWEEN interval '7 days' PRECEDING
AND CURRENT ROW EXCLUDE CURRENT ROW),0) - COALESCE(COUNT(*) OVER (partition by orders.customer_id order,orders.order_date by
orders.order_date asc range BETWEEN interval '7 days' PRECEDING
AND CURRENT ROW EXCLUDE CURRENT ROW),0) AS
customer_orders_last_seven_days from orders
这个想法是,为了从滚动计数中去除重复计数,我们应该从计算的滚动计数中减去那些在 order_time 中有重复的记录的计数。
【讨论】:
以上是关于使用窗口函数计算滚动计数的主要内容,如果未能解决你的问题,请参考以下文章
pandas使用rolling函数计算dataframe指定数据列特定窗口下的滚动均值(rolling mean)自定义指定滚动窗口的大小(window size)
pandas使用rolling函数计算dataframe指定数据列特定窗口下的滚动加和值(rolling sum)自定义指定滚动窗口的大小(window size)
pandas使用rolling函数计算dataframe指定数据列特定窗口下的滚动最小值(rolling minimum)自定义指定滚动窗口的大小(window size)
pandas使用rolling函数计算dataframe指定数据列特定窗口下的滚动最大值(rolling maximum)自定义指定滚动窗口的大小(window size)