如何从 postgresql 10.3 中的这个多重连接查询中删除嵌套循环

Posted 2023-04-14

技术标签:

【中文标题】如何从 postgresql 10.3 中的这个多重连接查询中删除嵌套循环【英文标题】：How do I remove the nested loop from this multiple join query in postgresql 10.3 【发布时间】：2018-03-14 03:54:27 【问题描述】：

我有一个名为 sources 的表，其中包含价格，我有另一个名为 destinations 的表，其中包含另一组值。我需要获取每个源的所有目标，因此进行交叉连接，将源表中的每个值与目标表中的每个值相乘 source_id 和destination_id 是主键，我想将此结果表与另一个表内部连接，该表当前给我一个嵌套循环

方法 1

//has a nested loop
EXPLAIN SELECT * FROM 
(select concat(s.source_id, ':', d.destination_id) AS pair_id, 
(s.price * d.price) AS pair_price 
FROM e1_sources s 
CROSS JOIN e1_destinations d) AS p
INNER JOIN e1_alerts a
ON a.pair=p.pair_id
WHERE 
(p.pair_price > a.value AND a.direction=true) OR
(p.pair_price <= a.value AND a.direction=false)

方法 2

//has a nested loop
EXPLAIN WITH pairs AS 
(
    SELECT 
    concat(s.source_id, ':', d.destination_id) AS pair_id,
    (s.price * d.price) AS pair_price
    FROM e1_sources s
    CROSS JOIN e1_destinations d
)
SELECT * from pairs p
INNER JOIN e1_alerts a
ON p.pair_id=a.pair
WHERE 
(p.pair_price > a.value AND a.direction=true) OR 
(p.pair_price <= a.value AND a.direction=false)

方法 1 分析

"Hash Join  (cost=3697.72..210978.26 rows=1297875 width=114)"
"  Hash Cond: (concat(s.source_id, ':', d.destination_id) = (a.pair)::text)"
"  Join Filter: ((((s.price * d.price) > a.value) AND a.direction) OR (((s.price * d.price) <= a.value) AND (NOT a.direction)))"
"  ->  Nested Loop  (cost=0.00..19303.43 rows=1540440 width=70)"
"        ->  Seq Scan on e1_sources s  (cost=0.00..25.56 rows=1556 width=16)"
"        ->  Materialize  (cost=0.00..24.85 rows=990 width=54)"
"              ->  Seq Scan on e1_destinations d  (cost=0.00..19.90 rows=990 width=54)"
"  ->  Hash  (cost=2025.00..2025.00 rows=75098 width=50)"
"        ->  Seq Scan on e1_alerts a  (cost=0.00..2025.00 rows=75098 width=50)"
"              Filter: (direction OR (NOT direction))"

ARPPOACH 2 分析

"Hash Join  (cost=56349.38..649740.92 rows=7089424 width=114)"
"  Hash Cond: (p.pair_id = (a.pair)::text)"
"  Join Filter: (((p.pair_price > a.value) AND a.direction) OR ((p.pair_price <= a.value) AND (NOT a.direction)))"
"  CTE pairs"
"    ->  Nested Loop  (cost=0.00..19378.74 rows=1104760 width=64)"
"          ->  Seq Scan on e1_sources s  (cost=0.00..26.56 rows=1556 width=16)"
"          ->  Materialize  (cost=0.00..20.65 rows=710 width=54)"
"                ->  Seq Scan on e1_destinations d  (cost=0.00..17.10 rows=710 width=54)"
"  ->  CTE Scan on pairs p  (cost=0.00..22095.20 rows=1104760 width=64)"
"  ->  Hash  (cost=20248.06..20248.06 rows=751007 width=50)"
"        ->  Seq Scan on e1_alerts a  (cost=0.00..20248.06 rows=751007 width=50)"
"              Filter: (direction OR (NOT direction))"

但是，如果我有一个单独的表，其中包含作为 pair_id 的交叉联接产品，然后如果我进行了内部联接，我只需在分析中进行哈希扫描，查询几乎不需要几毫秒

方法 3 我有一个名为对的物化视图，其中包含源和目标的交叉连接，其连接的 pair_id 作为主键现在内连接只需要几秒钟，因为它不执行嵌套循环

EXPLAIN ANALYZE 
SELECT * from pairs p 
INNER JOIN e1_alerts a
ON p.pair_id = a.pair 
WHERE 
(p.pair_price > a.value AND a.direction=true) OR
(p.pair_price <= a.value AND a.direction=false)

分析方法 3

"Hash Join  (cost=1459.32..4892.41 rows=30566 width=73) (actual time=14.048..92.158 rows=498 loops=1)"
"  Hash Cond: ((a.pair)::text = p.pair_id)"
"  Join Filter: (((p.pair_price > a.value) AND a.direction) OR ((p.pair_price <= a.value) AND (NOT a.direction)))"
"  Rows Removed by Join Filter: 99502"
"  ->  Seq Scan on e1_alerts a  (cost=0.00..2025.00 rows=75098 width=50) (actual time=0.010..16.658 rows=100000 loops=1)"
"        Filter: (direction OR (NOT direction))"
"  ->  Hash  (cost=836.92..836.92 rows=49792 width=23) (actual time=13.736..13.736 rows=49792 loops=1)"
"        Buckets: 65536  Batches: 1  Memory Usage: 3245kB"
"        ->  Seq Scan on pairs p  (cost=0.00..836.92 rows=49792 width=23) (actual time=0.005..5.029 rows=49792 loops=1)"
"Planning time: 0.494 ms"
"Execution time: 92.262 ms"

几个问题

方法 1 和 2 是否因为不知道 pair_id 是否为主键而进行嵌套连接，有什么方法可以告诉 postgresql 由交叉连接产生的特定列是唯一的吗？除了使用物化视图之外没有其他方法吗？我的源 x 目标表最多将包含 80000 个值，需要每 x 分钟更新一次，我不想向数据库发送这么多更新。如果我只发送大约 2000 个值的源和目标，我将能够从交叉连接生成对表

【问题讨论】：

注意：大多数人不喜欢水平滚动。（我怀疑你是否真的需要子查询） @wildplasser 这是“select concat(s.source_id, ':', d.destination_id) as pair_id”部分，因此我添加了一个子查询，如果您知道更好的方法来执行此操作，如果你能分享，超级有用，谢谢看起来您使用 concat() 来模拟复合（主）键 concat(s.source_id, ':', d.destination_id) as pair_id, 。为什么？字符串“XYZ:ABC”大致是连接列的样子，我在 e1_alerts 中有一个这样的列，我想与它执行内部连接为什么e1_alerts 没有复合键（或索引） 【参考方案1】：

好的，我找到了一个比我上面尝试过的解决方案快 100 倍的解决方案，但我不知道为什么。当我在方法 1 和方法 2 中的 2 列之间进行交叉连接时，我在 2 个表之间没有任何公共列。为了将此交叉连接转换为内连接，我只是在两个表中添加了具有相同重复数据的相同列，并以此列为借口执行了内连接，但现在结果在性能方面大不相同！！！

方法 4

explain analyze SELECT * 
FROM 
(select concat(s.source_id, ':', d.destination_id) as pair_id, 
(s.price * d.price) as pair_price 
FROM e1_sources s 
INNER JOIN e1_destinations d 
ON s.destination_id=d.source_id) as p
INNER JOIN e1_alerts a
ON a.pair=p.pair_id
WHERE 
(p.pair_price > a.value AND a.direction=true) OR
(p.pair_price <= a.value AND a.direction=false)

这是一种欺骗查询优化器相信它正在执行内部连接的方法吗？以内部连接为借口连接的相同数量的行完全消除了嵌套循环！如果有人能解释为什么会发生这种情况，我将不胜感激

分析方法 4

"Hash Join  (cost=456.66..712.93 rows=1862 width=114) (actual time=4.702..67.509 rows=51 loops=1)"
"  Hash Cond: (concat(s.source_id, ':', d.destination_id) = (a.pair)::text)"
"  Join Filter: ((((s.price * d.price) > a.value) AND a.direction) OR (((s.price * d.price) <= a.value) AND (NOT a.direction)))"
"  Rows Removed by Join Filter: 9949"
"  ->  Merge Join  (cost=159.78..246.19 rows=5524 width=70) (actual time=0.630..13.783 rows=49792 loops=1)"
"        Merge Cond: ((d.source_id)::text = (s.destination_id)::text)"
"        ->  Sort  (cost=50.72..52.50 rows=710 width=86) (actual time=0.042..0.049 rows=32 loops=1)"
"              Sort Key: d.source_id"
"              Sort Method: quicksort  Memory: 27kB"
"              ->  Seq Scan on e1_destinations d  (cost=0.00..17.10 rows=710 width=86) (actual time=0.020..0.025 rows=32 loops=1)"
"        ->  Sort  (cost=109.06..112.95 rows=1556 width=20) (actual time=0.583..4.144 rows=49761 loops=1)"
"              Sort Key: s.destination_id"
"              Sort Method: quicksort  Memory: 167kB"
"              ->  Seq Scan on e1_sources s  (cost=0.00..26.56 rows=1556 width=20) (actual time=0.010..0.268 rows=1556 loops=1)"
"  ->  Hash  (cost=203.00..203.00 rows=7510 width=50) (actual time=3.507..3.507 rows=10000 loops=1)"
"        Buckets: 16384 (originally 8192)  Batches: 1 (originally 1)  Memory Usage: 949kB"
"        ->  Seq Scan on e1_alerts a  (cost=0.00..203.00 rows=7510 width=50) (actual time=0.013..1.771 rows=10000 loops=1)"
"              Filter: (direction OR (NOT direction))"
"Planning time: 0.251 ms"
"Execution time: 67.590 ms"

【讨论】：

你在欺骗优化器，以至于它不得不求助于排序/合并+哈希连接。这可能适用于较小的结果集，但一旦超出 (work_)mem 就会变得很糟糕。在现实生活中，您真的应该在您的联结表e1_alerts 上使用（多个、有条件的）复合索引，而不是 carthesian product+join 。并且：对我来说，e1_alerts 的两个关键列似乎是相关的。感谢您的建议，但我有疑问，当 e1_alerts 中 pair_price 的每个值在 99% 的情况下都会有所不同时，为什么要使用索引，最坏的情况下也是 pair_id 的值case 可以有 50000 个不同的值，而在最好的情况下只能有 1 个值，因为我必须检查 e1_alerts 的所有行的 where 条件，我认为查询运行器会以任何方式跳过索引，你怎么看？我尝试了复合索引，但查询计划器没有使用它们中的任何一个，尝试了它们的所有组合您应该匹配索引键，而不是值。一旦找到对应的记录，比较value <--> value*value几乎是免费的。 ...您甚至不必匹配未找到键的值！

以上是关于如何从 postgresql 10.3 中的这个多重连接查询中删除嵌套循环的主要内容，如果未能解决你的问题，请参考以下文章