max() 与 ORDER BY DESC + LIMIT 1 的性能

Posted 2023-02-24

技术标签:

【中文标题】max() 与 ORDER BY DESC + LIMIT 1 的性能【英文标题】：Performance of max() vs ORDER BY DESC + LIMIT 1 【发布时间】：2016-03-18 17:11:03 【问题描述】：

我今天正在排查一些慢速 SQL 查询，不太了解下面的性能差异：

当尝试根据某些条件从数据表中提取max(timestamp) 时，如果存在匹配行，则使用MAX() 比ORDER BY timestamp LIMIT 1 慢，但如果找不到匹配行，则速度要快得多。

SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4
ORDER BY timestamp DESC
LIMIT 1;
(0 rows)  
Time: 1314.544 ms

SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5
ORDER BY timestamp DESC
LIMIT 1;
(1 row)  
Time: 10.890 ms

SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4;
(0 rows)
Time: 0.869 ms

SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5;
(1 row)
Time: 84.087 ms

(timestamp) 和 (sensor_id, timestamp) 上有索引，我注意到 Postgres 对这两种情况使用非常不同的查询计划和索引：

QUERY PLAN (ORDER BY)                                              
--------------------------------------------------------------------------------------------------------
Limit  (cost=0.43..9.47 rows=1 width=8)
    ->  Nested Loop  (cost=0.43..396254.63 rows=43823 width=8)
          Join Filter: (data.sensor_id = sensors.id)
          ->  Index Scan using timestamp_ind on data  (cost=0.43..254918.66 rows=4710976 width=12)
          ->  Materialize  (cost=0.00..6.70 rows=2 width=4)
              ->  Seq Scan on sensors  (cost=0.00..6.69 rows=2 width=4)
                  Filter: (station_id = 4)
(7 rows)

QUERY PLAN (MAX)                                               
----------------------------------------------------------------------------------------------------------
Aggregate  (cost=3680.59..3680.60 rows=1 width=8)
    ->  Nested Loop  (cost=0.43..3571.03 rows=43823 width=8)
        ->  Seq Scan on sensors  (cost=0.00..6.69 rows=2 width=4)
              Filter: (station_id = 4)
        ->  Index Only Scan using sensor_ind_timestamp on data  (cost=0.43..1389.59 rows=39258 width=12)
              Index Cond: (sensor_id = sensors.id)
(6 rows)

所以我的两个问题是：

EXISTS

编辑以解决以下 cmets 中的问题。我保留了上面的初始查询计划以供将来参考：

表定义：

                                  Table "public.sensors"
        Column        |          Type          |                            Modifiers                            
----------------------+------------------------+-----------------------------------------------------------------
id                    | integer                | not null default nextval('sensors_id_seq'::regclass)
station_id            | integer                | not null
....

Indexes:
    "sensor_primary" PRIMARY KEY, btree (id)
    "ind_station_id" btree (station_id, id)
    "ind_station" btree (station_id)

                                  Table "public.data"
  Column   |           Type           |                            Modifiers                             
-----------+--------------------------+------------------------------------------------------------------
 id        | integer                  | not null default nextval('data_id_seq'::regclass)
 timestamp | timestamp with time zone | not null
 sensor_id | integer                  | not null
 avg       | integer                  |

Indexes:
    "timestamp_ind" btree ("timestamp" DESC)
    "sensor_ind" btree (sensor_id)
    "sensor_ind_timestamp" btree (sensor_id, "timestamp")
    "sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC)

请注意，我刚刚在@Erwin 下面的建议之后在sensors 上添加了ind_station_id。时间并没有真正发生太大变化，ORDER BY DESC + LIMIT 1 案例中的 >1200ms 和 MAX 案例中的 ~0.9ms。

查询计划：

QUERY PLAN (ORDER BY)
----------------------------------------------------------------------------------------------------------
Limit  (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1)
  Buffers: shared hit=3418066 read=47326
  ->  Nested Loop  (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1)
        Join Filter: (data.sensor_id = sensors.id)
        Buffers: shared hit=3418066 read=47326
        ->  Index Scan using timestamp_ind on data  (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1)
              Buffers: shared hit=3418065 read=47326
        ->  Materialize  (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976)
              Buffers: shared hit=1
              ->  Index Only Scan using ind_station_id on sensors  (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1)
                    Index Cond: (station_id = 4)
                    Heap Fetches: 0
                    Buffers: shared hit=1
Planning time: 0.478 ms
Execution time: 2161.090 ms
(15 rows)

QUERY (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate  (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1)
   Buffers: shared hit=1
   ->  Nested Loop  (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1)
         Buffers: shared hit=1
         ->  Index Only Scan using ind_station_id on sensors  (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1)
               Index Cond: (station_id = 4)
               Heap Fetches: 0
               Buffers: shared hit=1
         ->  Index Only Scan using sensor_ind_timestamp on data  (cost=0.43..1389.59 rows=39258 width=12) (never executed)
               Index Cond: (sensor_id = sensors.id)
               Heap Fetches: 0
 Planning time: 0.435 ms
 Execution time: 0.048 ms
 (13 rows)

所以就像前面解释的那样，ORDER BY 执行 Scan using timestamp_in on data，而 MAX 情况下没有这样做。

Postgres 版本：来自 Ubuntu 存储库的 Postgres：PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit

请注意，NOT NULL 存在约束，因此ORDER BY 不必对空行进行排序。

还请注意，我对差异的来源很感兴趣。虽然不理想，但我可以使用EXISTS (<1ms) 然后SELECT (~11ms) 相对快速地检索数据。

【问题讨论】：

请显示explain (buffers, analyze)，而不仅仅是简单的explain，以及SELECT version();的输出虽然查询说WHERE sensor.station_id = 4，但EXPLAIN 输出中有Filter: (installation_id = 4)。请说清楚。另外，你说你在(sensor_id, timestamp) 上有一个索引，但是EXPLAIN 输出中的索引名称是timestamp_sensor_ind，表示顺序相反。请提供data 和sensors 的准确索引定义和表定义——包括所有约束。在 psql 中使用 \d tbl 或完成 CREATE TABLE 脚本会得到什么。抱歉各位，我尝试稍微清理一下命名，但显然复制了错误的版本 :( 索引实际上有一个我在测试中给出的神秘名称，并且 is 开启(sensor_id, timestamp)。过滤器打开station_id。今晚我会得到explain (buffers, analyze)的输出。不要忘记表和索引定义以及 Postgres 版本。除非您 @-reply，否则我们不会收到您的答复通知。在@ErwinBrandstetter 上面编辑了这么久，希望它有助于理解这里发生了什么。 【参考方案1】：

sensor.station_id 上似乎没有索引，这在这里很可能很重要。

max() 和 ORDER BY DESC + LIMIT 1 之间存在实际的差异。很多人似乎都错过了这一点。 NULL 值按降序排序first。所以ORDER BY timestamp DESC LIMIT 1如果存在则返回一行timestamp IS NULL，而聚合函数max()忽略NULL值并返回最新的非空时间戳。

对于您的情况，由于您的列 d.timestamp 定义为 NOT NULL（正如您的更新所揭示的那样），没有有效的区别。带有DESC NULLS LAST 和ORDER BY 中LIMIT 查询的相同子句的索引应该仍然为您提供最佳服务。我建议这些索引（我下面的查询建立在第二个）：

sensor(station_id, id)
data(sensor_id, timestamp DESC NULLS LAST)

您可以删除其他索引变体 ~~sensor_ind_timestamp~~ 和 ~~sensor_ind_timestamp_desc~~，除非您有其他查询仍然需要它们（不太可能，但可能）。

更重要的是，还有另一个困难：第一个表sensors 上的过滤器返回很少，但仍然（可能）多行。 Postgres 期望在您添加的 EXPLAIN 输出中找到 2 行 (rows=2)。完美的技术是对第二个表data 进行松散索引扫描 - 目前在 Postgres 9.4（或 Postgres 9.5）中没有实现。您可以重写查询以通过各种方式解决此限制。详情：

Optimize GROUP BY query to retrieve latest record per user

最好的应该是：

SELECT d.timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) d
WHERE  s.station_id = 4
ORDER  BY d.timestamp DESC NULLS LAST
LIMIT  1;

由于外部查询的样式大多无关紧要，您也可以：

SELECT max(d.timestamp) AS timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) d
WHERE  s.station_id = 4;

max() 变体现在的执行速度应该差不多：

SELECT max(d.timestamp) AS timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT max(timestamp) AS timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ) d
WHERE  s.station_id = 4;

甚至，最短的：

SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp
FROM   sensors s
WHERE  station_id = 4;

注意双括号！

LIMIT 在LATERAL 连接中的另一个优点是您可以检索所选行的任意列，而不仅仅是最新的时间戳（一列）。