SQL:当某些月份没有记录时,如何查询每月总和的平均值?

Posted

技术标签:

【中文标题】SQL:当某些月份没有记录时,如何查询每月总和的平均值?【英文标题】:SQL: How to query the average of monthly sum, when some months don't have record? 【发布时间】:2021-02-26 18:40:55 【问题描述】:

TL;WR:当有些月份没有记录(所以应该是0)时,如何查询每月总和的平均值?


背景

我的孩子每天都会报告他们做家务的时间(在 PostgreSQL 数据库中)。然后我的数据集如下所示:

date,user,duration

2020-01-01,Alice,120
2020-01-02,Bob,30
2020-01-03,Charlie,10
2020-01-23,Charlie,10

2020-02-03,Charlie,10
2020-02-23,Charlie,10

2020-03-02,Bob,30
2020-03-03,Charlie,10
2020-03-23,Charlie,10

我想知道他们每月平均做多少。具体来说,我想要的结果是:

爱丽丝:40 =(120+0+0)÷3 鲍勃:20 =(30+0+30)÷3 查理:20 =([10+10]+[10+10]+[10+10])÷3

问题

在某些月份,我没有某些用户的记录(例如,爱丽丝在 2 月和 3 月)。因此,运行以下嵌套查询不会返回我想要的结果;事实上,这并没有考虑到,因为这几个月没有记录,所以 Alice 在 2 月和 3 月的贡献应该是 0(这里的平均值被错误地计算为 120)。

-- this does not work
SELECT
    "user",
    round(avg(monthly_duration)) as avg_monthly_sum
FROM (
    SELECT
        date_trunc('month', date),
        "user",
        sum(duration) as monthly_duration
    FROM
        public.chores_record
    GROUP BY
        date_trunc('month', date),
        "user"
) AS monthly_sum
GROUP BY
    "user"
;
-- Doesn't return what I want:
--
-- "unique_user","avg_monthly_sum"
-- "Alice",120
-- "Bob",30
-- "Charlie",20

因此,我构建了一个相当繁琐的查询如下:

    列出独特的月份, 列出唯一用户, 生成月份×用户组合, 从原始数据中添加每月总和, 获取每月总和的平均值(假设 'null' = 0)。
SELECT
    unique_user,
    round(avg(COALESCE(monthly_duration, 0))) -- COALESCE transforms 'null' into 0
FROM (
    -- monthly duration with 'null' if no record for that user×month
    SELECT
        month_user_combinations.month,
        month_user_combinations.unique_user,
        monthly_duration.monthly_duration
    FROM
    (
        (
            -- all months×users combinations
            SELECT
                month,
                unique_user
            FROM (
                (
                    -- list of unique months
                    SELECT DISTINCT
                        date_trunc('month', date) as month
                    FROM
                        public.chores_record
                ) AS unique_months
                CROSS JOIN
                (
                    -- list of unique users
                    SELECT DISTINCT
                        "user" as "unique_user"
                    FROM
                        public.chores_record
                ) AS unique_users
            )
        ) AS month_user_combinations
        LEFT OUTER JOIN
        (
            -- monthly duration for existing month×user combination only
            SELECT
                date_trunc('month', date) as month,
                "user",
                sum(duration) as monthly_duration
            FROM
                public.chores_record
            GROUP BY
                date_trunc('month', date),
                "user"
        ) AS monthly_duration
        ON (
            month_user_combinations.month = monthly_duration.month
            AND
            month_user_combinations.unique_user = monthly_duration.user
        )
    )
) AS monthly_duration_for_all_combinations
GROUP BY
    unique_user
;

此查询有效,但相当庞大。

问题

如何比上面更优雅地查询月总和的平均值,同时考虑“无记录⇒月总和=0”?

注意:可以安全地假设我只想计算具有至少一条记录的月份的平均值(即这里不考虑 12 月或 4 月是正常的。)


MWE

CREATE TABLE public.chores_record
(
    date date NOT NULL,
    "user" text NOT NULL,
    duration integer NOT NULL,
    PRIMARY KEY (date, "user")
);

INSERT INTO
    public.chores_record(date, "user", duration)
VALUES
    ('2020-01-01','Alice',120),
    ('2020-01-02','Bob',30),
    ('2020-01-03','Charlie',10),
    ('2020-01-23','Charlie',10),
    ('2020-02-03','Charlie',10),
    ('2020-02-23','Charlie',10),
    ('2020-03-02','Bob',30),
    ('2020-03-03','Charlie',10),
    ('2020-03-23','Charlie',10)
;

【问题讨论】:

考虑在应用程序代码中处理数据显示的问题(例如丢失数据) @Strawberry 听起来很有趣,但我不确定是否完全理解。您能否详细说明或举例说明您的意思? 【参考方案1】:

您可以使用 CTE 来构造日历表:


-- EXPLAIN
WITH cal AS ( -- The unique months
        SELECT DISTINCT date_trunc('mon', zdate) AS tick
        FROM chores_record
        )
, cnt AS (      -- the number of months (a scalar)
        SELECT COUNT(*) AS nmonth
        FROM cal
        )
SELECT
        x.zuser
        , SUM(x.duration) AS tot_duration
        , SUM(x.duration) / SUM(c.nmonth) AS Averarage_month -- this is ugly ...
FROM cal t
JOIN cnt c ON true -- This is ugly
LEFT JOIN chores_record x ON date_trunc('mon', x.zdate) = t.tick
GROUP BY x.zuser
        ;

【讨论】:

【参考方案2】:

为此,您需要两个额外的数据集:孩子列表和月份列表:

with
    k as (/* list of kids */
        select distinct "user" from chores_record),
    m as (/* list of months in format "yyyy-mm-01" */
        select distinct date_trunc('month', "date") as "month" from chores_record),
    d as (/* sums by moths and kids */
        select
            date_trunc('month', "date") as "month",
            "user",
            sum(duration) as duration
        from chores_record
        group by 1, 2)
select
    m."month",
    k."user",
    coalesce(d.duration, 0) as duration
from
    k cross join m left join d on (k."user" = d."user" and m."month" = d."month")
order by "month", "user";

┌────────────────────────┬─────────┬──────────┐
│         month          │  user   │ duration │
├────────────────────────┼─────────┼──────────┤
│ 2020-01-01 00:00:00+02 │ Alice   │      120 │
│ 2020-01-01 00:00:00+02 │ Bob     │       30 │
│ 2020-01-01 00:00:00+02 │ Charlie │       20 │
│ 2020-02-01 00:00:00+02 │ Alice   │        0 │
│ 2020-02-01 00:00:00+02 │ Bob     │        0 │
│ 2020-02-01 00:00:00+02 │ Charlie │       20 │
│ 2020-03-01 00:00:00+02 │ Alice   │        0 │
│ 2020-03-01 00:00:00+02 │ Bob     │       30 │
│ 2020-03-01 00:00:00+02 │ Charlie │       20 │
└────────────────────────┴─────────┴──────────┘

最后一步是计算平均值:

with
    ...
select
    k."user",
    avg(coalesce(d.duration, 0)) as duration
from
    k cross join m left join d on (k."user" = d."user" and m."month" = d."month")
group by k."user"
order by k."user";

┌─────────┬─────────────────────┐
│  user   │      duration       │
├─────────┼─────────────────────┤
│ Alice   │ 40.0000000000000000 │
│ Bob     │ 20.0000000000000000 │
│ Charlie │ 20.0000000000000000 │
└─────────┴─────────────────────┘

【讨论】:

【参考方案3】:

在 Postgres 中,我建议使用 generate_series() 构建日历表,然后进行聚合。好处是即使有几个月没有用户活跃,它也能正常工作。

select u."user", avg(coalesce(c.duration, 0)) avg_duration 
from (
    select generate_series(date_trunc('month', min(date)), date_trunc('month', max(date)), '1 month') as dt
    from chores_record
) d
cross join (select distinct "user" from chores_record) u
left join (
    select "user", date_trunc('month', date) as dt, sum(duration) as duration
    from chores_record c 
    group by "user", date_trunc('month', date)
) c on c."user" = u."user" and c.dt = d.dt
group by u."user"

generate_series() 生成表中最早日期和最晚日期之间的所有月份开始日期。然后我们cross join 使用不同用户列表(在现实生活中,您可能有一个参考表来存储用户,您将使用它)。然后,我们按用户和月份聚合原始表,并将left join 与用户/月份组合。最后一步是外部聚合。

【讨论】:

generate_series() 既聪明又强大,谢谢:这正是我正在寻找的那种“更聪明的方式”! (小错字:倒数第二行应为[…] and c.dt = d.dt @ebosi:确实。我修正了错字。【参考方案4】:

由于用例较小(不是数百万行),一个简单的方法是单独查找

    每位用户的总小时数和 所有用户的不同月份总数

加入2获得答案

select "user", totalHours/monthCount from
(select "user", sum(duration) totalHours from chores_record group by "user") as a,
(select count(distinct(to_char(date, 'YYYYMM'))) monthCount from chores_record) as b
;
Alice,40
Bob,20
Charlie,20

【讨论】:

虽然此代码可能会回答问题,但提供有关它如何和/或为什么解决问题的额外上下文将提高​​答案的长期价值。

以上是关于SQL:当某些月份没有记录时,如何查询每月总和的平均值?的主要内容,如果未能解决你的问题,请参考以下文章

SQL 查询每个月统计的数据。

关于SQL查询报表,按月份显示出每月各个业务的办理量

sql 获取每月当前的年度总销售额

在MySql中、怎样根据年份或者月份查询数据表中的数据?

当总和达到某个阈值时如何返回记录

在MySql中、怎样根据年份或者月份查询数据表中的数据?