DISTINCT INNER JOIN 慢

Posted

技术标签:

【中文标题】DISTINCT INNER JOIN 慢【英文标题】:DISTINCT INNER JOIN slow 【发布时间】:2013-11-11 11:25:54 【问题描述】:

我已经编写了以下 PostgreSQL 查询,它可以正常工作。但是,它似乎非常慢,有时需要长达 10 秒才能返回结果。我确信我的陈述中有一些东西导致它变慢了。

谁能帮助确定为什么这个查询很慢?

SELECT DISTINCT ON (school_classes.class_id,attendance_calendar.school_date)
  school_classes.class_id, school_classes.class_name, school_classes.grade_id
, school_gradelevels.linked_calendar, attendance_calendars.calendar_id
, attendance_calendar.school_date, attendance_calendar.minutes
, teacher_join_classes_subjects.staff_id, staff.first_name, staff.last_name  

FROM school_classes 
INNER JOIN school_gradelevels ON school_gradelevels.id=school_classes.grade_id 
INNER JOIN teacher_join_classes_subjects ON teacher_join_classes_subjects.class_id=school_classes.class_id 
INNER JOIN staff ON staff.staff_id=teacher_join_classes_subjects.staff_id 
INNER JOIN attendance_calendars ON attendance_calendars.title=school_gradelevels.linked_calendar 
INNER JOIN attendance_calendar ON attendance_calendar.calendar_id=attendance_calendars.calendar_id 

WHERE teacher_join_classes_subjects.syear='2013' 
AND staff.syear='2013' 
AND attendance_calendars.syear='2013' 
AND teacher_join_classes_subjects.does_attendance='Y' 
AND teacher_join_classes_subjects.subject_id IS NULL 
AND attendance_calendar.school_date<CURRENT_DATE 

AND attendance_calendar.school_date NOT IN (

SELECT com.school_date FROM attendance_completed com
WHERE  com.class_id=school_classes.class_id
AND   (com.period_id='101' AND attendance_calendar.minutes>='151' OR
       com.period_id='95'  AND attendance_calendar.minutes='150') )

我将NOT IN 替换为以下内容:

AND NOT EXISTS (
    SELECT com.school_date
    FROM attendance_completed com
    WHERE com.class_id=school_classes.class_id
    AND com.school_date=attendance_calendar.school_date
    AND (com.period_id='101' AND attendance_calendar.minutes>='151' OR
         com.period_id='95'  AND attendance_calendar.minutes='150') )

解释分析的结果:

唯一(成本=2998.39..2998.41 行=3 宽度=85)(实际时间=10751.111..10751.118 行=1 循环=1) -> 排序(成本=2998.39..2998.40 行=3 宽度=85)(实际时间=10751.110..10751.110 行=2 循环=1) 排序键:school_classes.class_id、出席日历.school_date 排序方法:快速排序内存:25kB -> Hash Join (cost=2.03..2998.37 rows=3 width=85) (实际时间=6409.471..10751.045 rows=2 loops=1) 哈希条件:((teacher_join_classes_subjects.class_id = school_classes.class_id) AND (school_gradelevels.id = school_classes.grade_id)) 加入过滤器:(不是(子计划 1)) -> 嵌套循环(成本=0.00..120.69 行=94 宽度=81)(实际时间=2.468..1187.397 行=26460 循环=1) 加入过滤器:(attendance_calendars.calendar_id = admission_calendar.calendar_id) -> 嵌套循环(成本=0.00..42.13 行=1 宽度=70)(实际时间=0.087..3.247 行=735 循环=1) 加入过滤器:((attendance_calendars.title)::text = (school_gradelevels.linked_calendar)::text) -> 嵌套循环(成本=0.00..40.80 行=1 宽度=277)(实际时间=0.077..1.005 行=245 循环=1) -> 嵌套循环(成本=0.00..39.61 行=1 宽度=27)(实际时间=0.064..0.572 行=49 循环=1) -> 对teacher_join_classes_subjects 的序列扫描(成本=0.00..10.48 行=4 宽度=14)(实际时间=0.022..0.143 行=49 循环=1) 过滤器:((subject_id IS NULL) AND (syear = 2013::numeric) AND ((does_attendance)::text = 'Y'::text)) -> 使用staff_pkey 对staff 进行索引扫描(成本=0.00..7.27 行=1 宽度=20)(实际时间=0.006..0.007 行=1 循环=49) 指数条件:(staff.staff_id = teacher_join_classes_subjects.staff_id) 过滤器:(staff.syear = 2013::numeric) -> 在出勤日历上进行 Seq 扫描(成本=0.00..1.18 行=1 宽度=250)(实际时间=0.003..0.006 行=5 循环=49) 过滤器:(attendance_calendars.syear = 2013::numeric) -> Seq Scan on school_gradelevels(成本=0.00..1.15 行=15 宽度=11)(实际时间=0.001..0.005 行=15 循环=245) -> 在出勤_日历上进行 Seq 扫描(成本=0.00..55.26 行=1864 宽度=18)(实际时间=0.003..1.129 行=1824 循环=735) 过滤器:(attendance_calendar.school_date Hash (cost=1.41..1.41 rows=41 width=18) (实际时间=0.040..0.040 rows=41 loops=1) -> Seq Scan on school_classes (cost=0.00..1.41 rows=41 width=18) (实际时间=0.006..0.015 rows=41 loops=1) 子计划 1 -> 在出勤_完成的 com 上进行 Seq 扫描(成本=0.00..958.28 行=5 宽度=4)(实际时间=0.228..5.411 行=17 循环=1764) 过滤器: ((class_id = $0) AND (((period_id = 101::numeric) AND ($1 >= 151::numeric)) OR ((period_id = 95::numeric) AND ($1 = 150::numeric)) ))

【问题讨论】:

而不是 NOT IN,如果我 DO AND NOT EXISTS,那么整个事情运行得非常快,所以我假设 NOT IN 语句中有问题。有什么建议吗? 我已经通过使用 NOT EXISTS 而不是使用 NOT IN 解决了​​这个问题。它现在变得超级快。 你真的得到同样的结果吗?我相信 NOT EXISTS 只是检查“内部”查询是否返回任何行。由于语法错误,仅在查询中将 NOT IN 更改为 NOT EXISTS 实际上应该不起作用。您能否将 EXPLAIN ANALYZE 的结果粘贴到您的原始查询中? 感谢您的回复 Petter,我已经用 EXPLAIN ANALYZE 结果更新了它。并且还包括似乎有帮助的 NOT EXISTS 语句。 【参考方案1】:

NOT EXISTS 是一个很好的选择。几乎总是比NOT IN 好。 More details here. 我稍微简化了您的查询(通常看起来不错):

SELECT DISTINCT ON (c.class_id, a.school_date)
       c.class_id, c.class_name, c.grade_id
      ,g.linked_calendar, aa.calendar_id
      ,a.school_date, a.minutes
      ,t.staff_id, s.first_name, s.last_name  
FROM   school_classes                c
JOIN   teacher_join_classes_subjects t  USING (class_id)
JOIN   staff                         s  USING (staff_id)
JOIN   school_gradelevels            g  ON g.id = c.grade_id 
JOIN   attendance_calendars          aa ON aa.title = g.linked_calendar 
JOIN   attendance_calendar           a  ON a.calendar_id = aa.calendar_id 
WHERE  t.syear = 2013
AND    s.syear = 2013
AND    aa.syear = 2013
AND    t.does_attendance = 'Y'   -- looks like it should be boolean!
AND    t.subject_id IS NULL 
AND    a.school_date < CURRENT_DATE 
AND NOT EXISTS (
   SELECT 1
   FROM   attendance_completed x
   WHERE  x.class_id = c.class_id
   AND    x.school_date = a.school_date
   AND   (x.period_id = 101 AND a.minutes >= 151 OR  -- actually numbers?
          x.period_id =  95 AND a.minutes  = 150)
   )
ORDER BY c.class_id, a.school_date, ???

似乎缺少的是ORDER BY which should accompany your DISTINCT ON。添加更多 ORDER BY 项目来代替 ???。如果有重复项可供选择,您可能需要定义 which 来选择。

Numeric literals 不需要单引号,boolean 值应该这样编码。 您可能想重新访问chapter about data types。

【讨论】:

感谢您花时间添加额外的信息和链接,不知道这个。

以上是关于DISTINCT INNER JOIN 慢的主要内容,如果未能解决你的问题,请参考以下文章

Knex.js INNER JOIN 结果的 DISTINCT

即使使用 INNER JOIN 而不是 IN,MySQL 查询也非常慢

join on 与inner join 有啥不同呢 ?

表的基本查询语句及使用连表(inner joinleft join)子查询

关于SQL 查询效率问题 left join 改成 inner join union

如何将 IN 条件转换为 INNER JOIN 条件 - 加入速度较慢