通用表表达式实现 UPDATE/DELETE LIMIT

Posted 2021-05-29 不剪发的Tony老师

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了通用表表达式实现 UPDATE/DELETE LIMIT相关的知识，希望对你有一定的参考价值。

limit

文章目录

🍺少而好学，如日出之阳；壮而好学，如日中之光；志而好学，如炳烛之光。——刘向

大家好！我是只谈技术不剪发的 Tony 老师。

在 SQL 查询语句中，我们可以使用 FETCH（标准语法）、LIMIT（mysql、PostgreSQL、SQLite）、TOP（Microsoft SQL Server）、ROWNUM（Oracle）限定返回结果的数量，从而实现 Top-N 以及分页查询等功能。同时，MySQL 和 SQLite 还支持在 UPDATE 和 DELETE 语句中使用 LIMIT 子句限定更新和删除的行数。但是这种方法不适用于其他数据库，因此本文介绍一种更加通用的实现方法。

如果觉得文章有用，欢迎评论📝、点赞👍、推荐🎁

使用场景

在介绍具体方法之前，我们先来了解一下这种语法的使用场景。简单来说，它的主要作用是将大型事务分解为多个小的事务，包括：

批量更新和删除，减少锁资源的争用，限制对系统的影响。
减少数据文件和重做日志磁盘的使用。
分区之间的批量数据迁移。

通用表表达式

我们首先创建一个测试表和示例数据：

CREATE TABLE big_table (id INT PRIMARY KEY, v VARCHAR(10));

WITH RECURSIVE d(n) AS (
  SELECT 1
  UNION ALL 
  SELECT n + 1 FROM d WHERE n < 100000
)
INSERT INTO big_table
SELECT n, concat('id:', n) FROM d;

SELECT * FROM big_table ORDER BY id LIMIT 10;
id|v    |
--+-----+
 1|id:1 |
 2|id:2 |
 3|id:3 |
 4|id:4 |
 5|id:5 |
 6|id:6 |
 7|id:7 |
 8|id:8 |
 9|id:9 |
10|id:10|

我们使用递归通用表表达式（WITH 子句）为示例表 big_table 生成了 100000 条记录。

📝关于通用表表达式的语法和使用案例，可以参考这篇文章和这篇文章。

由于 SELECT 语句支持查询结果的限制，我们可以将一个这样的查询语句的结果用于 DELETE 或者 UPDATE 语句，作为删除或者更新的过滤条件。使用通用表表达式实现的语法如下：

WITH deleted_rows AS (
  SELECT id
  FROM big_table
  LIMIT 10
)
DELETE FROM big_table
WHERE id IN (SELECT id FROM deleted_rows);

我们首先使用 CTE 构造了一个临时表 deleted_rows，其中包含了想要删除的数据。然后在 DELETE 语句中使用子查询引用了这个临时表作为判断条件。我们可以查看一下以上语句的执行计划（以 PostgreSQL 为例）：

EXPLAIN
WITH deleted_rows AS (
  SELECT id
  FROM big_table
  LIMIT 10
)
DELETE FROM big_table
WHERE id IN (SELECT id FROM deleted_rows);

QUERY PLAN                                                                                               |
---------------------------------------------------------------------------------------------------------+
Delete on big_table  (cost=0.57..83.48 rows=0 width=0)                                                   |
  ->  Nested Loop  (cost=0.57..83.48 rows=10 width=34)                                                   |
        ->  HashAggregate  (cost=0.28..0.38 rows=10 width=32)                                            |
              Group Key: deleted_rows.id                                                                 |
              ->  Subquery Scan on deleted_rows  (cost=0.00..0.25 rows=10 width=32)                      |
                    ->  Limit  (cost=0.00..0.15 rows=10 width=4)                                         |
                          ->  Seq Scan on big_table big_table_1  (cost=0.00..1541.00 rows=100000 width=4)|
        ->  Index Scan using big_table_pkey on big_table  (cost=0.29..8.31 rows=1 width=10)              |
              Index Cond: (id = deleted_rows.id)                                                         |

CTE 的实现采用了全表顺序扫描（Seq Scan on big_table），因为我们只需要返回任意的 10 条记录。我们也可以使用关联子查询替代 IN 子句：

EXPLAIN 
WITH deleted_rows AS (
  SELECT id
  FROM big_table
  LIMIT 10
)
DELETE FROM big_table
WHERE EXISTS (SELECT * FROM deleted_rows d WHERE d.id = big_table.id);

QUERY PLAN                                                                                               |
---------------------------------------------------------------------------------------------------------+
Delete on big_table  (cost=0.57..83.48 rows=0 width=0)                                                   |
  ->  Nested Loop  (cost=0.57..83.48 rows=10 width=34)                                                   |
        ->  HashAggregate  (cost=0.28..0.38 rows=10 width=32)                                            |
              Group Key: d.id                                                                            |
              ->  Subquery Scan on d  (cost=0.00..0.25 rows=10 width=32)                                 |
                    ->  Limit  (cost=0.00..0.15 rows=10 width=4)                                         |
                          ->  Seq Scan on big_table big_table_1  (cost=0.00..1541.00 rows=100000 width=4)|
        ->  Index Scan using big_table_pkey on big_table  (cost=0.29..8.31 rows=1 width=10)              |
              Index Cond: (id = d.id)                                                                    |

两者的查询计划一样。

然后我们再为 CTE 增加一个基于 id 的排序操作：

EXPLAIN
WITH deleted_rows AS (
  SELECT id
  FROM big_table
  ORDER BY id
  LIMIT 10
)
DELETE FROM big_table
WHERE id IN (SELECT id FROM deleted_rows);

QUERY PLAN                                                                                                                           |
-------------------------------------------------------------------------------------------------------------------------------------+
Delete on big_table  (cost=0.97..83.88 rows=0 width=0)                                                                               |
  ->  Nested Loop  (cost=0.97..83.88 rows=10 width=34)                                                                               |
        ->  HashAggregate  (cost=0.68..0.78 rows=10 width=32)                                                                        |
              Group Key: deleted_rows.id                                                                                             |
              ->  Subquery Scan on deleted_rows  (cost=0.29..0.65 rows=10 width=32)                                                  |
                    ->  Limit  (cost=0.29..0.55 rows=10 width=4)                                                                     |
                          ->  Index Only Scan using big_table_pkey on big_table big_table_1  (cost=0.29..2604.29 rows=100000 width=4)|
        ->  Index Scan using big_table_pkey on big_table  (cost=0.29..8.31 rows=1 width=10)                                          |
              Index Cond: (id = deleted_rows.id)                                                                                     |

此时 CTE 的实现采用了索引扫描（Index Only Scan using big_table_pkey on big_table）。

对于 UPDATE 语句，我们同样可以使用这种方式模拟 LIMIT 子句。例如：

EXPLAIN ANALYZE 
WITH updated_rows AS (
  SELECT id
  FROM big_table
  ORDER BY id
  LIMIT 10
)
UPDATE big_table
SET v = 'new value'
WHERE EXISTS (SELECT * FROM updated_rows d WHERE d.id = big_table.id);

QUERY PLAN                                                                                                                                                                      |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Update on big_table  (cost=0.97..83.88 rows=0 width=0) (actual time=0.586..0.587 rows=0 loops=1)                                                                                |
  ->  Nested Loop  (cost=0.97..83.88 rows=10 width=72) (actual time=0.056..0.078 rows=10 loops=1)                                                                               |
        ->  HashAggregate  (cost=0.68..0.78 rows=10 width=32) (actual time=0.044..0.048 rows=10 loops=1)                                                                        |
              Group Key: d.id                                                                                                                                                   |
              Batches: 1  Memory Usage: 24kB                                                                                                                                    |
              ->  Subquery Scan on d  (cost=0.29..0.65 rows=10 width=32) (actual time=0.034..0.038 rows=10 loops=1)                                                             |
                    ->  Limit  (cost=0.29..0.55 rows=10 width=4) (actual time=0.019..0.021 rows=10 loops=1)                                                                     |
                          ->  Index Only Scan using big_table_pkey on big_table big_table_1  (cost=0.29..2604.29 rows=100000 width=4) (actual time=0.018..0.019 rows=10 loops=1)|
                                Heap Fetches: 0                                                                                                                                 |
        ->  Index Scan using big_table_pkey on big_table  (cost=0.29..8.31 rows=1 width=10) (actual time=0.002..0.002 rows=1 loops=10)                                          |
              Index Cond: (id = d.id)                                                                                                                                           |
Planning Time: 0.232 ms                                                                                                                                                         |
Execution Time: 0.641 ms                                                                                                                                                        |

SELECT COUNT(*) 
FROM big_table 
WHERE v = 'new value';
count|
-----+
   10|

随机更新或者删除

如果我们想要更新或者删除随机的 N 条记录，可以在 CTE 中使用 ORDER BY RANDOM()，不过可能会影响操作的性能，因为查询需要对表中的全表数据进行排序。例如：

EXPLAIN ANALYZE 
WITH updated_rows AS (
  SELECT id
  FROM big_table
  ORDER BY random()
  LIMIT 10
)
UPDATE big_table
SET v = 'new value'
WHERE EXISTS (SELECT * FROM updated_rows d WHERE d.id = big_table.id);

QUERY PLAN                                                                                                                                      |
------------------------------------------------------------------------------------------------------------------------------------------------+
Update on big_table  (cost=3952.51..4035.42 rows=0 width=0) (actual time=49.711..49.713 rows=0 loops=1)                                         |
  CTE updated_rows                                                                                                                              |
    ->  Limit  (cost=3951.96..3951.99 rows=10 width=12) (actual time=49.438..49.441 rows=10 loops=1)                                            |
          ->  Sort  (cost=3951.96..4201.96 rows=100000 width=12) (actual time=49.433..49.434 rows=10 loops=1)                                   |
                Sort Key: (random())                                                                                                            |
                Sort Method: top-N heapsort  Memory: 25kB                                                                                       |
                ->  Seq Scan on big_table big_table_1  (cost=0.00..1791.00 rows=100000 width=12) (actual time=0.034..29.596 rows=100000 loops=1)|
  ->  Nested Loop  (cost=0.52..83.43 rows=10 width=72) (actual time=49.493..49.534 rows=10 loops=1)                                             |
        ->  HashAggregate  (cost=0.23..0.33 rows=10 width=32) (actual time=49.462..49.465 rows=10 loops=1)                                      |
              Group Key: d.id                                                                                                                   |
              Batches: 1  Memory Usage: 24kB                                                                                                    |
              ->  CTE Scan on updated_rows d  (cost=0.00..0.20 rows=10 width=32) (actual time=49.446..49.453 rows=10 loops=1)                   |
        ->  Index Scan using big_table_pkey on big_table  (cost=0.29..8.31 rows=1 width=10) (actual time=0.006..0.006 rows=1 loops=10)          |
              Index Cond: (id = d.id)                                                                                                           |
Planning Time: 0.333 ms                                                                                                                         |
Execution Time: 49.820 ms                                                                                                                       |

此时，数据库需要遍历所有的数据并创建随机数，我们的执行时间从 0.641 ms 上升到了 49.820 ms。

当我们基于非索引字段排序时的情况也是如此，例如：

EXPLAIN ANALYZE 
WITH updated_rows AS (
  SELECT id
  FROM big_table
  ORDER BY v
  LIMIT 10
)
UPDATE big_table
SET v = 'new value'
WHERE EXISTS (SELECT * FROM updated_rows d WHERE d.id = big_table.id);

QUERY PLAN                                                                                                                                                     |
---------------------------------------------------------------------------------------------------------------------------------------------------------------+
Update on big_table  (cost=3702.21..5505.83 rows=0 width=0) (actual time=88.642..88.647 rows=0 loops=1)                                                        |
  ->  Hash Semi Join  (cost=3702.21..5505.83 rows=10 width=72) (actual time=19.764..41.442 rows=10 loops=1)                                                    |
        Hash Cond: (big_table.id = d.id)                                                                                                                       |
        ->  Seq Scan on big_table  (cost=0.00..1541.00 rows=100000 width=10) (actual time=0.044..13.495 rows=100000 loops=1)                                   |
        ->  Hash  (cost=3702.09..3702.09 rows=10 width=32) (actual time=19.652..19.656 rows=10 loops=1)                                                        |
              Buckets: 1024  Batches: 1  Memory Usage: 9kB                                                                                                     |
              ->  Subquery Scan on d  (cost=3701.96..3702.09 rows=10 width=32) (actual time=19.629..19.641 rows=10 loops=1)                                    |
                    ->  Limit  (cost=3701.96..3701.99 rows=10 width=12) (actual time=19.602..19以上是关于通用表表达式实现 UPDATE/DELETE LIMIT的主要内容，如果未能解决你的问题，请参考以下文章