非常慢的 MySQL COUNT DISTINCT 查询,即使有索引——如何优化?
Posted
技术标签:
【中文标题】非常慢的 MySQL COUNT DISTINCT 查询,即使有索引——如何优化?【英文标题】:Very slow MySQL COUNT DISTINCT query, even with indexes — how can this be optimised? 【发布时间】:2021-09-17 08:19:30 【问题描述】:我有一个 mysql (MariaDB 10.3) 查询,运行大约需要 60 秒。我需要对此进行显着优化,因为它让我的网络应用程序的用户感到沮丧。
查询返回用户名,然后返回 12 列,显示他们按月注册的有资格赚取佣金的客户数量。然后它返回另外 12 列,显示每个月内为用户记录了多少佣金条目。 (出于兼容性原因,查询需要以这种 24 列格式返回。)
这是查询:
SELECT
people.full_name AS "Name",
/* Count how many unique customers are eligible for commission in each month, for a rolling 12-month window */
COUNT(DISTINCT(CASE WHEN customers.commission_start_date BETWEEN "2020-08-01" AND "2020-08-31" THEN customers.id END)) AS "eligible_customers_month_1",
COUNT(DISTINCT(CASE WHEN customers.commission_start_date BETWEEN "2020-09-01" AND "2020-09-30" THEN customers.id END)) AS "eligible_customers_month_2",
COUNT(DISTINCT(CASE WHEN customers.commission_start_date BETWEEN "2020-10-01" AND "2020-10-31" THEN customers.id END)) AS "eligible_customers_month_3",
COUNT(DISTINCT(CASE WHEN customers.commission_start_date BETWEEN "2020-11-01" AND "2020-11-30" THEN customers.id END)) AS "eligible_customers_month_4",
COUNT(DISTINCT(CASE WHEN customers.commission_start_date BETWEEN "2020-12-01" AND "2020-12-31" THEN customers.id END)) AS "eligible_customers_month_5",
COUNT(DISTINCT(CASE WHEN customers.commission_start_date BETWEEN "2021-01-01" AND "2021-01-31" THEN customers.id END)) AS "eligible_customers_month_6",
COUNT(DISTINCT(CASE WHEN customers.commission_start_date BETWEEN "2021-02-01" AND "2021-02-28" THEN customers.id END)) AS "eligible_customers_month_7",
COUNT(DISTINCT(CASE WHEN customers.commission_start_date BETWEEN "2021-03-01" AND "2021-03-31" THEN customers.id END)) AS "eligible_customers_month_8",
COUNT(DISTINCT(CASE WHEN customers.commission_start_date BETWEEN "2021-04-01" AND "2021-04-30" THEN customers.id END)) AS "eligible_customers_month_9",
COUNT(DISTINCT(CASE WHEN customers.commission_start_date BETWEEN "2021-05-01" AND "2021-05-31" THEN customers.id END)) AS "eligible_customers_month_10",
COUNT(DISTINCT(CASE WHEN customers.commission_start_date BETWEEN "2021-06-01" AND "2021-06-30" THEN customers.id END)) AS "eligible_customers_month_11",
COUNT(DISTINCT(CASE WHEN customers.commission_start_date BETWEEN "2021-07-01" AND "2021-07-31" THEN customers.id END)) AS "eligible_customers_month_12",
/* In each month of a rolling 12-month window, count how many unique commission entries were recorded. */
COUNT(DISTINCT(CASE WHEN user_commission.commission_paid_at BETWEEN "2020-08-01" AND "2020-08-31" THEN user_commission.id END)) AS "total_sales_1",
COUNT(DISTINCT(CASE WHEN user_commission.commission_paid_at BETWEEN "2020-09-01" AND "2020-09-30" THEN user_commission.id END)) AS "total_sales_2",
COUNT(DISTINCT(CASE WHEN user_commission.commission_paid_at BETWEEN "2020-10-01" AND "2020-10-31" THEN user_commission.id END)) AS "total_sales_3",
COUNT(DISTINCT(CASE WHEN user_commission.commission_paid_at BETWEEN "2020-11-01" AND "2020-11-30" THEN user_commission.id END)) AS "total_sales_4",
COUNT(DISTINCT(CASE WHEN user_commission.commission_paid_at BETWEEN "2020-12-01" AND "2020-12-31" THEN user_commission.id END)) AS "total_sales_5",
COUNT(DISTINCT(CASE WHEN user_commission.commission_paid_at BETWEEN "2021-01-01" AND "2021-01-31" THEN user_commission.id END)) AS "total_sales_6",
COUNT(DISTINCT(CASE WHEN user_commission.commission_paid_at BETWEEN "2021-02-01" AND "2021-02-28" THEN user_commission.id END)) AS "total_sales_7",
COUNT(DISTINCT(CASE WHEN user_commission.commission_paid_at BETWEEN "2021-03-01" AND "2021-03-31" THEN user_commission.id END)) AS "total_sales_8",
COUNT(DISTINCT(CASE WHEN user_commission.commission_paid_at BETWEEN "2021-04-01" AND "2021-04-30" THEN user_commission.id END)) AS "total_sales_9",
COUNT(DISTINCT(CASE WHEN user_commission.commission_paid_at BETWEEN "2021-05-01" AND "2021-05-31" THEN user_commission.id END)) AS "total_sales_10",
COUNT(DISTINCT(CASE WHEN user_commission.commission_paid_at BETWEEN "2021-06-01" AND "2021-06-30" THEN user_commission.id END)) AS "total_sales_11",
COUNT(DISTINCT(CASE WHEN user_commission.commission_paid_at BETWEEN "2021-07-01" AND "2021-07-31" THEN user_commission.id END)) AS "total_sales_12"
FROM users
LEFT JOIN people ON people.id = users.person_id
LEFT JOIN customers ON customers.user_id = users.id
LEFT JOIN user_commission ON user_commission.user_id = users.id
WHERE users.id NOT IN (103, 2, 155, 24, 137, 141, 143, 149, 152, 3, 135)
GROUP BY users.id
这是EXPLAIN SELECT
的输出:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | users | index | PRIMARY | PRIMARY | 4 | 16 | Using where | |
1 | SIMPLE | people | eq_ref | PRIMARY | PRIMARY | 4 | users.person_id | 1 | Using where |
1 | SIMPLE | customers | ref | user_id | user_id | 5 | users.id | 284 | Using where |
1 | SIMPLE | user_commission | ref | comm_index,user_id | comm_index | 4 | users.id | 465 | Using index |
comm_index
是user_commission
表上的UNIQUE
索引,覆盖user_id,order_id,commission_paid_at
。
对于接下来要做什么,我有点难过 - 有索引,但引擎要解析每个表的行并不多。
任何线索将不胜感激 - 谢谢!
【问题讨论】:
Edit问题并添加涉及的表和索引的CREATE
语句。
多少行不算多?查询仅加入客户并仅返回客户列并仅加入佣金并仅返回佣金列的查询速度有多快?如果这些单独更快,请在两个子查询中执行它们并在用户 ID 上加入他们两个(和人)
这是一个报表查询,不适合高交互应用。您只能每小时运行一次查询并缓存结果。我不明白你为什么要每次都为每个用户运行它。
【参考方案1】:
首先,您选择所有行,而不是只选择您感兴趣的月份。
解决方案:WHERE
子句限制考虑的行。
然后,您将用户的客户与用户的佣金交叉加入,从而构建了一个您不需要和想要的巨大中间结果。
解决方案:加入前聚合。为了
例如可以这样看:
SELECT
people.full_name AS "Name",
cu.eligible_customers_month_1,
cu.eligible_customers_month_2,
...
co.total_sales_1,
co.total_sales_2,
...
FROM users
LEFT JOIN people ON people.id = users.person_id
LEFT JOIN
(
select
user_id,
max(case when month_index = 1 then cnt else 0 end) as eligible_customers_month_1,
max(case when month_index = 2 then cnt else 0 end) as eligible_customers_month_2,
...
from
(
select
user_id,
(year(current_date) * 12 + month(current_date))
- (year(commission_start_date) * 12 + month(commission_start_date))
+ 1 as month_index,
count(*) as cnt
from customers
where commission_start_date >=
last_day(current_date) + interval 1 day - interval 1 year
group by user_id, month_num
) months
group by user_id
) cu ON cu.user_id = users.id
LEFT JOIN
(
(
select
user_id,
max(case when month_index = 1 then cnt else 0 end) as total_sales_1,
max(case when month_index = 2 then cnt else 0 end) as total_sales_2,
...
from
select
user_id,
(year(current_date) * 12 + month(current_date))
- (year(commission_paid_at) * 12 + month(commission_paid_at))
+ 1 as month_index,
count(*) as cnt
from user_commission
where commission_paid_at >=
last_day(current_date) + interval 1 day - interval 1 year
group by user_id, month_num
) months
group by user_id
) co ON co.user_id = users.id
WHERE users.id NOT IN (103, 2, 155, 24, 137, 141, 143, 149, 152, 3, 135)
ORDER BY users.id;
推荐索引:
create index idx1 on customers (commission_start_date, user_id);
create index idx2 on user_commission (commission_paid_at, user_id);
【讨论】:
【参考方案2】:让我们首先开始这个查询适用于每个用户(除了您想要排除的少数例外 - 我没有在我的查询中包含该排除列表),我想问您为什么要显示销售和佣金计数所有用户查看所有用户的表现。我认为如果我是贵公司的代表,我只关心我的活动进展如何。
接下来,这可能是一个很好的例子,可以建议每个用户每月计数的预聚合表,这样您就不必不断重新尝试即时计算。如果数据没有改变,例如当新客户注册或输入销售佣金时,您最好保留在每天结束时为它所代表的给定用户/月/年计算的数据。但这也是一种选择。
现在,您可能会遇到较大的延迟时间,并且您在给定的客户和佣金表上使用 COUNT(DISTINCT) 的原因是您得到的是笛卡尔结果。因此,让我们假设您有 100 个用户。在这些用户中,在给定的一个月内,一个用户有 3 个新客户,2 个佣金,因为他们是新客户。然而,一位长期代表有 37 个新客户和 45 个佣金。这些是杀死你的人。因为您的左连接在用户 ID 上,所以它从给定用户的客户表中获取 1 条记录,并将其加入到佣金表中,以获得与销售记录的相同用户 ID。所以第一个代表它创建了 6 个条目计数(3 * 2)。但是第二个用户经历了 1,665 次迭代。所以,这个笛卡尔(或交叉连接)结果正在杀死你。
这就是它失败的原因。现在,谈谈我为你准备的解决方案。您似乎在代码中左右都有一堆硬编码的日期。下个月到来时会发生什么。您是否必须硬编码修复开始/结束日期?如果是这样,那么我为您提供的解决方案将简化这一切。
通过使用“WITH”(Common-Table-Expression aka CTE),您可以预先编写查询并使用您在多嵌套查询中编写每个查询的那些“别名”名称。但好处是查询只写一次,即使你不断重复使用别名引用。
这是查询,接下来我将对其进行描述/分解,以便您查看/关注。
with Rolling12 as
(
select
@rptMonth := @rptMonth +1 as QryMonth,
@beginDate as AtLeastDate,
date_add( @beginDate, interval 1 month ) as AndLessThanDate,
@beginDate := date_add( @beginDate, interval 1 month )
from
user_commission
JOIN ( select @rptMonth := 0,
@beginDate := date_sub(
date_add(
date_sub( curdate(),
interval day( curdate()) -1 day ),
interval 1 month ),
interval 1 year )
) sqlvars
limit 12
),
MinMaxDates as
(
select
min( AtLeastDate ) MinDate,
max( AndLessThanDate ) MaxDate
from
Rolling12
),
SumCommission as
(
select
uc.user_id,
coalesce( sum( CASE WHEN R12.QryMonth = 1 then 1 else 0 end ), 0) commission01,
coalesce( sum( CASE WHEN R12.QryMonth = 2 then 1 else 0 end ), 0) commission02,
coalesce( sum( CASE WHEN R12.QryMonth = 3 then 1 else 0 end ), 0) commission03,
coalesce( sum( CASE WHEN R12.QryMonth = 4 then 1 else 0 end ), 0) commission04,
coalesce( sum( CASE WHEN R12.QryMonth = 5 then 1 else 0 end ), 0) commission05,
coalesce( sum( CASE WHEN R12.QryMonth = 6 then 1 else 0 end ), 0) commission06,
coalesce( sum( CASE WHEN R12.QryMonth = 7 then 1 else 0 end ), 0) commission07,
coalesce( sum( CASE WHEN R12.QryMonth = 8 then 1 else 0 end ), 0) commission08,
coalesce( sum( CASE WHEN R12.QryMonth = 9 then 1 else 0 end ), 0) commission09,
coalesce( sum( CASE WHEN R12.QryMonth = 10 then 1 else 0 end ), 0) commission10,
coalesce( sum( CASE WHEN R12.QryMonth = 11 then 1 else 0 end ), 0) commission11,
coalesce( sum( CASE WHEN R12.QryMonth = 12 then 1 else 0 end ), 0) commission12
from
user_commission uc
JOIN Rolling12 R12
on uc.commission_paid_at >= R12.AtLeastDate
AND uc.commission_paid_at < R12.AndLessThanDate
-- only a single row returned for MinMaxDates source
JOIN MinMaxDates mm
where
uc.commission_paid_at >= mm.MinDate
AND uc.commission_paid_at < mm.MaxDate
group by
uc.user_id
),
SumCustomers as
(
select
c.user_id,
coalesce( sum( CASE WHEN R12.QryMonth = 1 then 1 else 0 end ), 0) customers01,
coalesce( sum( CASE WHEN R12.QryMonth = 2 then 1 else 0 end ), 0) customers02,
coalesce( sum( CASE WHEN R12.QryMonth = 3 then 1 else 0 end ), 0) customers03,
coalesce( sum( CASE WHEN R12.QryMonth = 4 then 1 else 0 end ), 0) customers04,
coalesce( sum( CASE WHEN R12.QryMonth = 5 then 1 else 0 end ), 0) customers05,
coalesce( sum( CASE WHEN R12.QryMonth = 6 then 1 else 0 end ), 0) customers06,
coalesce( sum( CASE WHEN R12.QryMonth = 7 then 1 else 0 end ), 0) customers07,
coalesce( sum( CASE WHEN R12.QryMonth = 8 then 1 else 0 end ), 0) customers08,
coalesce( sum( CASE WHEN R12.QryMonth = 9 then 1 else 0 end ), 0) customers09,
coalesce( sum( CASE WHEN R12.QryMonth = 10 then 1 else 0 end ), 0) customers10,
coalesce( sum( CASE WHEN R12.QryMonth = 11 then 1 else 0 end ), 0) customers11,
coalesce( sum( CASE WHEN R12.QryMonth = 12 then 1 else 0 end ), 0) customers12
from
customers c
JOIN Rolling12 R12
on c.commission_start_date >= R12.AtLeastDate
AND c.commission_start_date < R12.AndLessThanDate
-- only a single row returned for MinMaxDates source
JOIN MinMaxDates mm
where
c.commission_start_date >= mm.MinDate
AND c.commission_start_date < mm.MaxDate
group by
c.user_id
)
select
u.id,
p.full_name AS "Name",
com.Commission01,
com.Commission02,
com.Commission03,
com.Commission04,
com.Commission05,
com.Commission06,
com.Commission07,
com.Commission08,
com.Commission09,
com.Commission10,
com.Commission11,
com.Commission12,
cst.Customers01,
cst.Customers02,
cst.Customers03,
cst.Customers04,
cst.Customers05,
cst.Customers06,
cst.Customers07,
cst.Customers08,
cst.Customers09,
cst.Customers10,
cst.Customers11,
cst.Customers12
from
users u
JOIN People p
ON u.person_id = p.id
LEFT JOIN SumCommission com
on u.id = com.user_id
LEFT JOIN SumCustomers cst
on u.id = cst.user_id;
您声明您正在以 12 个月的滚动周期运行。为此,我有我的第一个 CTE 别名“Rolling12”。此查询是其余查询的设置。它创建 MySQL 变量并持续计算每个月份的更新开始/结束日期。它首先取当前日期,例如:7 月 6 日并将其回滚到 7 月 1 日。然后添加 1 个月以获得 8 月 1 日,然后从 2020 年 8 月 1 日减去 1 年,作为 12 个月滚动计算的开始期。然后,我简单地加入佣金表并限制为 12 条记录,每次向前并为支付期的开始和结束日期创建一个列,并为其分配一个月份 ID 序列。
如果您突出显示并仅在 With Rolling12 中运行查询作为(查询),您将看到它构建的内容。这可以防止与您当前的 24 个案例/计数不同的条件相关联的所有硬编码日期。
然后是逗号和 MinMaxDates 的下一个 CTE。在这里,我从这 12 个月的滚动中查询以获取您报告的整个期间的最短开始日期和结束日期,因此在查询销售客户和佣金时,我可以将其作为开始/的单行结果加入详细信息的结束日期。
接下来是 SumCommission 和 SumCustomers。这些通过 JOIN 与 CTE“Rolling12”记录相结合,因此我们可以将特定的佣金或客户与该日期范围条目相关联。因此,我得到滚动 12 的查询月份和 sum() 它。但是由于 null 的 sum() 会导致 null,因此我将其用 coalesce(calculation, 0) 包装起来,以将 0 显示为最坏的情况。
每个单独运行并按用户分组的原因是为了防止前面提到的笛卡尔结果。
一旦这些单独的部分都完成了,我现在从用户开始,加入人们以获取名称,然后 LEFT-JOIN 到相应的其他 SUM() 查询。因此,如果用户在一个月内只有一个新客户,但没有佣金,那么您将只有该组中的一条记录,而另一组则没有,从而防止重复查询结果,需要您以 DISTINCT 开头。
因此,即使它看起来很长并且可能令人困惑,尤其是 WITH CTE 上下文,请查看它的各个部分。 SUMs() 是按用户 ID 预先分组的,因此每个 sum() 结果在该给定时间段内每个用户只有一个可能的记录。
至于帮助优化查询的索引,我会确保commission 和customer 表分别在(dateField,useridField)上有一个索引。
如果你试一试,我很想知道它的表现如何。
【讨论】:
以上是关于非常慢的 MySQL COUNT DISTINCT 查询,即使有索引——如何优化?的主要内容,如果未能解决你的问题,请参考以下文章
postgresql COUNT(DISTINCT ...) 非常慢
MySQL中distinct和count(*)的使用方法比较
mysql innodb count(distinct)很慢,怎么优化