DISTINCT INNER JOIN 慢

Posted 2023-02-24

技术标签:

【中文标题】DISTINCT INNER JOIN 慢【英文标题】：DISTINCT INNER JOIN slow 【发布时间】：2013-11-11 11:25:54 【问题描述】：

我已经编写了以下 PostgreSQL 查询，它可以正常工作。但是，它似乎非常慢，有时需要长达 10 秒才能返回结果。我确信我的陈述中有一些东西导致它变慢了。

谁能帮助确定为什么这个查询很慢？

SELECT DISTINCT ON (school_classes.class_id,attendance_calendar.school_date)
  school_classes.class_id, school_classes.class_name, school_classes.grade_id
, school_gradelevels.linked_calendar, attendance_calendars.calendar_id
, attendance_calendar.school_date, attendance_calendar.minutes
, teacher_join_classes_subjects.staff_id, staff.first_name, staff.last_name  

FROM school_classes 
INNER JOIN school_gradelevels ON school_gradelevels.id=school_classes.grade_id 
INNER JOIN teacher_join_classes_subjects ON teacher_join_classes_subjects.class_id=school_classes.class_id 
INNER JOIN staff ON staff.staff_id=teacher_join_classes_subjects.staff_id 
INNER JOIN attendance_calendars ON attendance_calendars.title=school_gradelevels.linked_calendar 
INNER JOIN attendance_calendar ON attendance_calendar.calendar_id=attendance_calendars.calendar_id 

WHERE teacher_join_classes_subjects.syear='2013' 
AND staff.syear='2013' 
AND attendance_calendars.syear='2013' 
AND teacher_join_classes_subjects.does_attendance='Y' 
AND teacher_join_classes_subjects.subject_id IS NULL 
AND attendance_calendar.school_date<CURRENT_DATE 

AND attendance_calendar.school_date NOT IN (

SELECT com.school_date FROM attendance_completed com
WHERE  com.class_id=school_classes.class_id
AND   (com.period_id='101' AND attendance_calendar.minutes>='151' OR
       com.period_id='95'  AND attendance_calendar.minutes='150') )

我将NOT IN 替换为以下内容：

AND NOT EXISTS (
    SELECT com.school_date
    FROM attendance_completed com
    WHERE com.class_id=school_classes.class_id
    AND com.school_date=attendance_calendar.school_date
    AND (com.period_id='101' AND attendance_calendar.minutes>='151' OR
         com.period_id='95'  AND attendance_calendar.minutes='150') )

解释分析的结果：

唯一（成本=2998.39..2998.41 行=3 宽度=85）（实际时间=10751.111..10751.118 行=1 循环=1） -> 排序（成本=2998.39..2998.40 行=3 宽度=85）（实际时间=10751.110..10751.110 行=2 循环=1）排序键：school_classes.class_id、出席日历.school_date 排序方法：快速排序内存：25kB -> Hash Join (cost=2.03..2998.37 rows=3 width=85) (实际时间=6409.471..10751.045 rows=2 loops=1) 哈希条件：((teacher_join_classes_subjects.class_id = school_classes.class_id) AND (school_gradelevels.id = school_classes.grade_id)) 加入过滤器：（不是（子计划 1）） -> 嵌套循环（成本=0.00..120.69 行=94 宽度=81）（实际时间=2.468..1187.397 行=26460 循环=1）加入过滤器：（attendance_calendars.calendar_id = admission_calendar.calendar_id） -> 嵌套循环（成本=0.00..42.13 行=1 宽度=70）（实际时间=0.087..3.247 行=735 循环=1）加入过滤器：((attendance_calendars.title)::text = (school_gradelevels.linked_calendar)::text) -> 嵌套循环（成本=0.00..40.80 行=1 宽度=277）（实际时间=0.077..1.005 行=245 循环=1） -> 嵌套循环（成本=0.00..39.61 行=1 宽度=27）（实际时间=0.064..0.572 行=49 循环=1） -> 对teacher_join_classes_subjects 的序列扫描（成本=0.00..10.48 行=4 宽度=14）（实际时间=0.022..0.143 行=49 循环=1）过滤器：((subject_id IS NULL) AND (syear = 2013::numeric) AND ((does_attendance)::text = 'Y'::text)) -> 使用staff_pkey 对staff 进行索引扫描（成本=0.00..7.27 行=1 宽度=20）（实际时间=0.006..0.007 行=1 循环=49）指数条件：（staff.staff_id = teacher_join_classes_subjects.staff_id）过滤器：（staff.syear = 2013::numeric） -> 在出勤日历上进行 Seq 扫描（成本=0.00..1.18 行=1 宽度=250）（实际时间=0.003..0.006 行=5 循环=49）过滤器：（attendance_calendars.syear = 2013::numeric） -> Seq Scan on school_gradelevels（成本=0.00..1.15 行=15 宽度=11）（实际时间=0.001..0.005 行=15 循环=245） -> 在出勤_日历上进行 Seq 扫描（成本=0.00..55.26 行=1864 宽度=18）（实际时间=0.003..1.129 行=1824 循环=735）过滤器：(attendance_calendar.school_date Hash (cost=1.41..1.41 rows=41 width=18) (实际时间=0.040..0.040 rows=41 loops=1) -> Seq Scan on school_classes (cost=0.00..1.41 rows=41 width=18) (实际时间=0.006..0.015 rows=41 loops=1) 子计划 1 -> 在出勤_完成的 com 上进行 Seq 扫描（成本=0.00..958.28 行=5 宽度=4）（实际时间=0.228..5.411 行=17 循环=1764）过滤器： ((class_id = $0) AND (((period_id = 101::numeric) AND ($1 >= 151::numeric)) OR ((period_id = 95::numeric) AND ($1 = 150::numeric)) ))

【问题讨论】：

而不是 NOT IN，如果我 DO AND NOT EXISTS，那么整个事情运行得非常快，所以我假设 NOT IN 语句中有问题。有什么建议吗？我已经通过使用 NOT EXISTS 而不是使用 NOT IN 解决了这个问题。它现在变得超级快。你真的得到同样的结果吗？我相信 NOT EXISTS 只是检查“内部”查询是否返回任何行。由于语法错误，仅在查询中将 NOT IN 更改为 NOT EXISTS 实际上应该不起作用。您能否将 EXPLAIN ANALYZE 的结果粘贴到您的原始查询中？感谢您的回复 Petter，我已经用 EXPLAIN ANALYZE 结果更新了它。并且还包括似乎有帮助的 NOT EXISTS 语句。 【参考方案1】：

NOT EXISTS 是一个很好的选择。几乎总是比NOT IN 好。 More details here. 我稍微简化了您的查询（通常看起来不错）：

SELECT DISTINCT ON (c.class_id, a.school_date)
       c.class_id, c.class_name, c.grade_id
      ,g.linked_calendar, aa.calendar_id
      ,a.school_date, a.minutes
      ,t.staff_id, s.first_name, s.last_name  
FROM   school_classes                c
JOIN   teacher_join_classes_subjects t  USING (class_id)
JOIN   staff                         s  USING (staff_id)
JOIN   school_gradelevels            g  ON g.id = c.grade_id 
JOIN   attendance_calendars          aa ON aa.title = g.linked_calendar 
JOIN   attendance_calendar           a  ON a.calendar_id = aa.calendar_id 
WHERE  t.syear = 2013
AND    s.syear = 2013
AND    aa.syear = 2013
AND    t.does_attendance = 'Y'   -- looks like it should be boolean!
AND    t.subject_id IS NULL 
AND    a.school_date < CURRENT_DATE 
AND NOT EXISTS (
   SELECT 1
   FROM   attendance_completed x
   WHERE  x.class_id = c.class_id
   AND    x.school_date = a.school_date
   AND   (x.period_id = 101 AND a.minutes >= 151 OR  -- actually numbers?
          x.period_id =  95 AND a.minutes  = 150)
   )
ORDER BY c.class_id, a.school_date, ???

似乎缺少的是ORDER BY which should accompany your DISTINCT ON。添加更多 ORDER BY 项目来代替 ???。如果有重复项可供选择，您可能需要定义 which 来选择。

Numeric literals 不需要单引号，boolean 值应该这样编码。您可能想重新访问chapter about data types。

【讨论】：

感谢您花时间添加额外的信息和链接，不知道这个。

以上是关于DISTINCT INNER JOIN 慢的主要内容，如果未能解决你的问题，请参考以下文章

Knex.js INNER JOIN 结果的 DISTINCT

即使使用 INNER JOIN 而不是 IN，MySQL 查询也非常慢

join on 与inner join 有啥不同呢？

表的基本查询语句及使用连表（inner joinleft join）子查询

关于SQL 查询效率问题 left join 改成 inner join union

如何将 IN 条件转换为 INNER JOIN 条件 - 加入速度较慢