SQL：当某些月份没有记录时，如何查询每月总和的平均值？

Posted 2023-02-16

技术标签:

【中文标题】SQL：当某些月份没有记录时，如何查询每月总和的平均值？【英文标题】：SQL: How to query the average of monthly sum, when some months don't have record? 【发布时间】：2021-02-26 18:40:55 【问题描述】：

TL;WR:当有些月份没有记录（所以应该是0）时，如何查询每月总和的平均值？

背景

我的孩子每天都会报告他们做家务的时间（在 PostgreSQL 数据库中）。然后我的数据集如下所示：

date,user,duration

2020-01-01,Alice,120
2020-01-02,Bob,30
2020-01-03,Charlie,10
2020-01-23,Charlie,10

2020-02-03,Charlie,10
2020-02-23,Charlie,10

2020-03-02,Bob,30
2020-03-03,Charlie,10
2020-03-23,Charlie,10

我想知道他们每月平均做多少。具体来说，我想要的结果是：

爱丽丝：40 =(120+0+0)÷3 鲍勃：20 =(30+0+30)÷3 查理：20 =([10+10]+[10+10]+[10+10])÷3

问题

在某些月份，我没有某些用户的记录（例如，爱丽丝在 2 月和 3 月）。因此，运行以下嵌套查询不会返回我想要的结果；事实上，这并没有考虑到，因为这几个月没有记录，所以 Alice 在 2 月和 3 月的贡献应该是 0（这里的平均值被错误地计算为 120）。

-- this does not work
SELECT
    "user",
    round(avg(monthly_duration)) as avg_monthly_sum
FROM (
    SELECT
        date_trunc('month', date),
        "user",
        sum(duration) as monthly_duration
    FROM
        public.chores_record
    GROUP BY
        date_trunc('month', date),
        "user"
) AS monthly_sum
GROUP BY
    "user"
;
-- Doesn't return what I want:
--
-- "unique_user","avg_monthly_sum"
-- "Alice",120
-- "Bob",30
-- "Charlie",20

因此，我构建了一个相当繁琐的查询如下：

列出独特的月份，列出唯一用户，生成月份×用户组合，从原始数据中添加每月总和，获取每月总和的平均值（假设 'null' = 0）。

SELECT
    unique_user,
    round(avg(COALESCE(monthly_duration, 0))) -- COALESCE transforms 'null' into 0
FROM (
    -- monthly duration with 'null' if no record for that user×month
    SELECT
        month_user_combinations.month,
        month_user_combinations.unique_user,
        monthly_duration.monthly_duration
    FROM
    (
        (
            -- all months×users combinations
            SELECT
                month,
                unique_user
            FROM (
                (
                    -- list of unique months
                    SELECT DISTINCT
                        date_trunc('month', date) as month
                    FROM
                        public.chores_record
                ) AS unique_months
                CROSS JOIN
                (
                    -- list of unique users
                    SELECT DISTINCT
                        "user" as "unique_user"
                    FROM
                        public.chores_record
                ) AS unique_users
            )
        ) AS month_user_combinations
        LEFT OUTER JOIN
        (
            -- monthly duration for existing month×user combination only
            SELECT
                date_trunc('month', date) as month,
                "user",
                sum(duration) as monthly_duration
            FROM
                public.chores_record
            GROUP BY
                date_trunc('month', date),
                "user"
        ) AS monthly_duration
        ON (
            month_user_combinations.month = monthly_duration.month
            AND
            month_user_combinations.unique_user = monthly_duration.user
        )
    )
) AS monthly_duration_for_all_combinations
GROUP BY
    unique_user
;

此查询有效，但相当庞大。

问题

如何比上面更优雅地查询月总和的平均值，同时考虑“无记录⇒月总和=0”？

注意：可以安全地假设我只想计算具有至少一条记录的月份的平均值（即这里不考虑 12 月或 4 月是正常的。）

MWE

CREATE TABLE public.chores_record
(
    date date NOT NULL,
    "user" text NOT NULL,
    duration integer NOT NULL,
    PRIMARY KEY (date, "user")
);

INSERT INTO
    public.chores_record(date, "user", duration)
VALUES
    ('2020-01-01','Alice',120),
    ('2020-01-02','Bob',30),
    ('2020-01-03','Charlie',10),
    ('2020-01-23','Charlie',10),
    ('2020-02-03','Charlie',10),
    ('2020-02-23','Charlie',10),
    ('2020-03-02','Bob',30),
    ('2020-03-03','Charlie',10),
    ('2020-03-23','Charlie',10)
;

【问题讨论】：

考虑在应用程序代码中处理数据显示的问题（例如丢失数据） @Strawberry 听起来很有趣，但我不确定是否完全理解。您能否详细说明或举例说明您的意思？ 【参考方案1】：

您可以使用 CTE 来构造日历表：

-- EXPLAIN
WITH cal AS ( -- The unique months
        SELECT DISTINCT date_trunc('mon', zdate) AS tick
        FROM chores_record
        )
, cnt AS (      -- the number of months (a scalar)
        SELECT COUNT(*) AS nmonth
        FROM cal
        )
SELECT
        x.zuser
        , SUM(x.duration) AS tot_duration
        , SUM(x.duration) / SUM(c.nmonth) AS Averarage_month -- this is ugly ...
FROM cal t
JOIN cnt c ON true -- This is ugly
LEFT JOIN chores_record x ON date_trunc('mon', x.zdate) = t.tick
GROUP BY x.zuser
        ;

【讨论】：

【参考方案2】：

为此，您需要两个额外的数据集：孩子列表和月份列表：

with
    k as (/* list of kids */
        select distinct "user" from chores_record),
    m as (/* list of months in format "yyyy-mm-01" */
        select distinct date_trunc('month', "date") as "month" from chores_record),
    d as (/* sums by moths and kids */
        select
            date_trunc('month', "date") as "month",
            "user",
            sum(duration) as duration
        from chores_record
        group by 1, 2)
select
    m."month",
    k."user",
    coalesce(d.duration, 0) as duration
from
    k cross join m left join d on (k."user" = d."user" and m."month" = d."month")
order by "month", "user";

┌────────────────────────┬─────────┬──────────┐
│         month          │  user   │ duration │
├────────────────────────┼─────────┼──────────┤
│ 2020-01-01 00:00:00+02 │ Alice   │      120 │
│ 2020-01-01 00:00:00+02 │ Bob     │       30 │
│ 2020-01-01 00:00:00+02 │ Charlie │       20 │
│ 2020-02-01 00:00:00+02 │ Alice   │        0 │
│ 2020-02-01 00:00:00+02 │ Bob     │        0 │
│ 2020-02-01 00:00:00+02 │ Charlie │       20 │
│ 2020-03-01 00:00:00+02 │ Alice   │        0 │
│ 2020-03-01 00:00:00+02 │ Bob     │       30 │
│ 2020-03-01 00:00:00+02 │ Charlie │       20 │
└────────────────────────┴─────────┴──────────┘

最后一步是计算平均值：

with
    ...
select
    k."user",
    avg(coalesce(d.duration, 0)) as duration
from
    k cross join m left join d on (k."user" = d."user" and m."month" = d."month")
group by k."user"
order by k."user";

┌─────────┬─────────────────────┐
│  user   │      duration       │
├─────────┼─────────────────────┤
│ Alice   │ 40.0000000000000000 │
│ Bob     │ 20.0000000000000000 │
│ Charlie │ 20.0000000000000000 │
└─────────┴─────────────────────┘

【讨论】：

【参考方案3】：

在 Postgres 中，我建议使用 generate_series() 构建日历表，然后进行聚合。好处是即使有几个月没有用户活跃，它也能正常工作。

select u."user", avg(coalesce(c.duration, 0)) avg_duration 
from (
    select generate_series(date_trunc('month', min(date)), date_trunc('month', max(date)), '1 month') as dt
    from chores_record
) d
cross join (select distinct "user" from chores_record) u
left join (
    select "user", date_trunc('month', date) as dt, sum(duration) as duration
    from chores_record c 
    group by "user", date_trunc('month', date)
) c on c."user" = u."user" and c.dt = d.dt
group by u."user"

generate_series() 生成表中最早日期和最晚日期之间的所有月份开始日期。然后我们cross join 使用不同用户列表（在现实生活中，您可能有一个参考表来存储用户，您将使用它）。然后，我们按用户和月份聚合原始表，并将left join 与用户/月份组合。最后一步是外部聚合。

【讨论】：

generate_series() 既聪明又强大，谢谢：这正是我正在寻找的那种“更聪明的方式”！（小错字：倒数第二行应为[…] and c.dt = d.dt） @ebosi：确实。我修正了错字。【参考方案4】：

由于用例较小（不是数百万行），一个简单的方法是单独查找

每位用户的总小时数和所有用户的不同月份总数

加入2获得答案

select "user", totalHours/monthCount from
(select "user", sum(duration) totalHours from chores_record group by "user") as a,
(select count(distinct(to_char(date, 'YYYYMM'))) monthCount from chores_record) as b
;

Alice,40
Bob,20
Charlie,20

【讨论】：

虽然此代码可能会回答问题，但提供有关它如何和/或为什么解决问题的额外上下文将提高答案的长期价值。

以上是关于SQL：当某些月份没有记录时，如何查询每月总和的平均值？的主要内容，如果未能解决你的问题，请参考以下文章