SQL:当某些月份没有记录时,如何查询每月总和的平均值?
Posted
技术标签:
【中文标题】SQL:当某些月份没有记录时,如何查询每月总和的平均值?【英文标题】:SQL: How to query the average of monthly sum, when some months don't have record? 【发布时间】:2021-02-26 18:40:55 【问题描述】:TL;WR:当有些月份没有记录(所以应该是0)时,如何查询每月总和的平均值?
背景
我的孩子每天都会报告他们做家务的时间(在 PostgreSQL 数据库中)。然后我的数据集如下所示:
date,user,duration
2020-01-01,Alice,120
2020-01-02,Bob,30
2020-01-03,Charlie,10
2020-01-23,Charlie,10
2020-02-03,Charlie,10
2020-02-23,Charlie,10
2020-03-02,Bob,30
2020-03-03,Charlie,10
2020-03-23,Charlie,10
我想知道他们每月平均做多少。具体来说,我想要的结果是:
爱丽丝:40=(120+0+0)÷3
鲍勃:20 =(30+0+30)÷3
查理:20 =([10+10]+[10+10]+[10+10])÷3
问题
在某些月份,我没有某些用户的记录(例如,爱丽丝在 2 月和 3 月)。因此,运行以下嵌套查询不会返回我想要的结果;事实上,这并没有考虑到,因为这几个月没有记录,所以 Alice 在 2 月和 3 月的贡献应该是 0(这里的平均值被错误地计算为 120)。
-- this does not work
SELECT
"user",
round(avg(monthly_duration)) as avg_monthly_sum
FROM (
SELECT
date_trunc('month', date),
"user",
sum(duration) as monthly_duration
FROM
public.chores_record
GROUP BY
date_trunc('month', date),
"user"
) AS monthly_sum
GROUP BY
"user"
;
-- Doesn't return what I want:
--
-- "unique_user","avg_monthly_sum"
-- "Alice",120
-- "Bob",30
-- "Charlie",20
因此,我构建了一个相当繁琐的查询如下:
-
列出独特的月份,
列出唯一用户,
生成月份×用户组合,
从原始数据中添加每月总和,
获取每月总和的平均值(假设 'null' = 0)。
SELECT
unique_user,
round(avg(COALESCE(monthly_duration, 0))) -- COALESCE transforms 'null' into 0
FROM (
-- monthly duration with 'null' if no record for that user×month
SELECT
month_user_combinations.month,
month_user_combinations.unique_user,
monthly_duration.monthly_duration
FROM
(
(
-- all months×users combinations
SELECT
month,
unique_user
FROM (
(
-- list of unique months
SELECT DISTINCT
date_trunc('month', date) as month
FROM
public.chores_record
) AS unique_months
CROSS JOIN
(
-- list of unique users
SELECT DISTINCT
"user" as "unique_user"
FROM
public.chores_record
) AS unique_users
)
) AS month_user_combinations
LEFT OUTER JOIN
(
-- monthly duration for existing month×user combination only
SELECT
date_trunc('month', date) as month,
"user",
sum(duration) as monthly_duration
FROM
public.chores_record
GROUP BY
date_trunc('month', date),
"user"
) AS monthly_duration
ON (
month_user_combinations.month = monthly_duration.month
AND
month_user_combinations.unique_user = monthly_duration.user
)
)
) AS monthly_duration_for_all_combinations
GROUP BY
unique_user
;
此查询有效,但相当庞大。
问题
如何比上面更优雅地查询月总和的平均值,同时考虑“无记录⇒月总和=0”?
注意:可以安全地假设我只想计算具有至少一条记录的月份的平均值(即这里不考虑 12 月或 4 月是正常的。)
MWE
CREATE TABLE public.chores_record
(
date date NOT NULL,
"user" text NOT NULL,
duration integer NOT NULL,
PRIMARY KEY (date, "user")
);
INSERT INTO
public.chores_record(date, "user", duration)
VALUES
('2020-01-01','Alice',120),
('2020-01-02','Bob',30),
('2020-01-03','Charlie',10),
('2020-01-23','Charlie',10),
('2020-02-03','Charlie',10),
('2020-02-23','Charlie',10),
('2020-03-02','Bob',30),
('2020-03-03','Charlie',10),
('2020-03-23','Charlie',10)
;
【问题讨论】:
考虑在应用程序代码中处理数据显示的问题(例如丢失数据) @Strawberry 听起来很有趣,但我不确定是否完全理解。您能否详细说明或举例说明您的意思? 【参考方案1】:您可以使用 CTE 来构造日历表:
-- EXPLAIN
WITH cal AS ( -- The unique months
SELECT DISTINCT date_trunc('mon', zdate) AS tick
FROM chores_record
)
, cnt AS ( -- the number of months (a scalar)
SELECT COUNT(*) AS nmonth
FROM cal
)
SELECT
x.zuser
, SUM(x.duration) AS tot_duration
, SUM(x.duration) / SUM(c.nmonth) AS Averarage_month -- this is ugly ...
FROM cal t
JOIN cnt c ON true -- This is ugly
LEFT JOIN chores_record x ON date_trunc('mon', x.zdate) = t.tick
GROUP BY x.zuser
;
【讨论】:
【参考方案2】:为此,您需要两个额外的数据集:孩子列表和月份列表:
with
k as (/* list of kids */
select distinct "user" from chores_record),
m as (/* list of months in format "yyyy-mm-01" */
select distinct date_trunc('month', "date") as "month" from chores_record),
d as (/* sums by moths and kids */
select
date_trunc('month', "date") as "month",
"user",
sum(duration) as duration
from chores_record
group by 1, 2)
select
m."month",
k."user",
coalesce(d.duration, 0) as duration
from
k cross join m left join d on (k."user" = d."user" and m."month" = d."month")
order by "month", "user";
┌────────────────────────┬─────────┬──────────┐
│ month │ user │ duration │
├────────────────────────┼─────────┼──────────┤
│ 2020-01-01 00:00:00+02 │ Alice │ 120 │
│ 2020-01-01 00:00:00+02 │ Bob │ 30 │
│ 2020-01-01 00:00:00+02 │ Charlie │ 20 │
│ 2020-02-01 00:00:00+02 │ Alice │ 0 │
│ 2020-02-01 00:00:00+02 │ Bob │ 0 │
│ 2020-02-01 00:00:00+02 │ Charlie │ 20 │
│ 2020-03-01 00:00:00+02 │ Alice │ 0 │
│ 2020-03-01 00:00:00+02 │ Bob │ 30 │
│ 2020-03-01 00:00:00+02 │ Charlie │ 20 │
└────────────────────────┴─────────┴──────────┘
最后一步是计算平均值:
with
...
select
k."user",
avg(coalesce(d.duration, 0)) as duration
from
k cross join m left join d on (k."user" = d."user" and m."month" = d."month")
group by k."user"
order by k."user";
┌─────────┬─────────────────────┐
│ user │ duration │
├─────────┼─────────────────────┤
│ Alice │ 40.0000000000000000 │
│ Bob │ 20.0000000000000000 │
│ Charlie │ 20.0000000000000000 │
└─────────┴─────────────────────┘
【讨论】:
【参考方案3】:在 Postgres 中,我建议使用 generate_series()
构建日历表,然后进行聚合。好处是即使有几个月没有用户活跃,它也能正常工作。
select u."user", avg(coalesce(c.duration, 0)) avg_duration
from (
select generate_series(date_trunc('month', min(date)), date_trunc('month', max(date)), '1 month') as dt
from chores_record
) d
cross join (select distinct "user" from chores_record) u
left join (
select "user", date_trunc('month', date) as dt, sum(duration) as duration
from chores_record c
group by "user", date_trunc('month', date)
) c on c."user" = u."user" and c.dt = d.dt
group by u."user"
generate_series()
生成表中最早日期和最晚日期之间的所有月份开始日期。然后我们cross join
使用不同用户列表(在现实生活中,您可能有一个参考表来存储用户,您将使用它)。然后,我们按用户和月份聚合原始表,并将left join
与用户/月份组合。最后一步是外部聚合。
【讨论】:
generate_series()
既聪明又强大,谢谢:这正是我正在寻找的那种“更聪明的方式”! (小错字:倒数第二行应为[…] and c.dt = d.dt
)
@ebosi:确实。我修正了错字。【参考方案4】:
由于用例较小(不是数百万行),一个简单的方法是单独查找
-
每位用户的总小时数和
所有用户的不同月份总数
加入2获得答案
select "user", totalHours/monthCount from
(select "user", sum(duration) totalHours from chores_record group by "user") as a,
(select count(distinct(to_char(date, 'YYYYMM'))) monthCount from chores_record) as b
;
Alice,40
Bob,20
Charlie,20
【讨论】:
虽然此代码可能会回答问题,但提供有关它如何和/或为什么解决问题的额外上下文将提高答案的长期价值。以上是关于SQL:当某些月份没有记录时,如何查询每月总和的平均值?的主要内容,如果未能解决你的问题,请参考以下文章