为啥在 SQL 查询中 NOT IN 比 IN 慢得多

Posted 2023-03-29

技术标签:

【中文标题】为啥在 SQL 查询中 NOT IN 比 IN 慢得多【英文标题】：Why is NOT IN much slower than IN in SQL query为什么在 SQL 查询中 NOT IN 比 IN 慢得多 【发布时间】：2018-04-06 21:21:33 【问题描述】：

我发现了一个令人惊讶的（至少对我而言）IN 和 NOT IN 的事情。当我尝试解释 PostgreSQL 数据库的第一个查询时：

EXPLAIN DELETE
FROM AuditTaskImpl l
WHERE  l.processInstanceId in (select spl.processInstanceId
                               FROM ProcessInstanceLog spl
                               WHERE spl.status not in ( 2, 3))

它告诉我这个：

Delete on audittaskimpl l  (cost=2794.48..6373.52 rows=50859 width=12)
  ->  Hash Semi Join  (cost=2794.48..6373.52 rows=50859 width=12)
        Hash Cond: (l.processinstanceid = spl.processinstanceid)
        ->  Seq Scan on audittaskimpl l  (cost=0.00..2005.59 rows=50859 width=14)
        ->  Hash  (cost=1909.24..1909.24 rows=50899 width=14)
              ->  Seq Scan on processinstancelog spl  (cost=0.00..1909.24 rows=50899 width=14)
                    Filter: (status <> ALL ('2,3'::integer[]))

但是，当我换 in for not in 时，这只是一个否定：

EXPLAIN DELETE
FROM AuditTaskImpl l
WHERE  l.processInstanceId NOT in (select spl.processInstanceId
                               FROM ProcessInstanceLog spl
                               WHERE spl.status not in ( 2, 3))

它告诉我这个：

Delete on audittaskimpl l  (cost=0.00..63321079.15 rows=25430 width=6)
  ->  Seq Scan on audittaskimpl l  (cost=0.00..63321079.15 rows=25430 width=6)
        Filter: (NOT (SubPlan 1))
        SubPlan 1
          ->  Materialize  (cost=0.00..2362.73 rows=50899 width=8)
                ->  Seq Scan on processinstancelog spl  (cost=0.00..1909.24 rows=50899 width=8)
                      Filter: (status <> ALL ('2,3'::integer[]))

如您所见，使用 IN 它使用哈希连接，这当然要快得多，但使用 NOT IN 它只使用简单的逐行顺序扫描。但是由于 NOT IN 只是一个否定，它可以再次使用哈希连接并做相反的事情：当嵌套选择中有 processInstanceId 时使用 IN ，将其添加到结果中，当没有时，不要添加它，使用 NOT IN嵌套select中有processInstanceId时，不要添加到结果中，没有时添加到结果中。

那么你能解释一下为什么会发生这种情况吗？澄清 AuditTaskImpl 具有 processInstanceId 属性，该属性也存在于 ProcessInstanceLog 表中，尽管它们之间没有外键关系。

谢谢。

【问题讨论】：

NOT IN 必须考虑可能出现在任何行中的 NULL 值。请改用NOT EXISTS。 NOT IN 不只是否定，它涉及 三值逻辑 参见en.wikipedia.org/wiki/Three-valued_logic 或例如modern-sql.com/concept/three-valued-logic 简而言之：NOT IN 是邪恶的。 【参考方案1】：

NOT IN 的语义要求如果子查询中的 any 值为 NULL，则返回 nothing。因此，Postgres 需要查看所有值。

我强烈建议不要将NOT IN 与子查询一起使用。总是使用NOT EXISTS：

DELETE FROM AuditTaskImpl l
    WHERE NOT EXISTS (SELECT 1 
                      FROM ProcessInstanceLog spl
                      WHERE l.processInstanceId = spl.spl.processInstanceId AND
                            spl.status not in (2, 3)
                     );

【讨论】：

所以你说 1 IN (1,2,3,NULL) 为真而 1 NOT IN (2,3,NULL) 为假？我可以清楚地看到，在第一个示例中存在 1 个，而在第二个示例中则没有。 @Xenon: 1 NOT IN (2,3,NULL) 不是false，它是NULL。当存在“未知”值时，您无法推断出 1 不在集合中。 @Marth 好的，读了几页，是的，你是对的，所以它是 NULL/UNKNOWN 因此它不正确，所以它不会在查询中被选中，对吗？跨度> @Xenon 。 . . NULL 在 WHERE 子句和 CASE 表达式中被视为 false（但在 CHECK 约束中不被视为）。

以上是关于为啥在 SQL 查询中 NOT IN 比 IN 慢得多的主要内容，如果未能解决你的问题，请参考以下文章