PG中的物理连接方式

Posted 2023-03-08 瀚高PG实验室

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了PG中的物理连接方式相关的知识，希望对你有一定的参考价值。

环境

文档用途

详细信息

环境

系统平台：Linux x86-64 Red Hat Enterprise Linux 7

版本：10.3

文档用途

PG物理连接方式的理论基础与相关实验性能对比

详细信息

一、物理连接方式概要

规划器和优化器中真正复杂的事情与表连接有关,根本上讲，两个表以任何方式连接都会生成相同的输出，区别在于执行效率。

PG支持以下3种表连接操作：

Nested Loop Join

Merge Join

Hash Join

1、Nested Loop Join（嵌套循环连接）

Nested Loop是扫描一个表（外表），每读到一条记录，就根据Join字段上的索引去另一张表（内表）里面查找，若Join字段上没有索引查询优化器一般就不会选择 Nested Loop；

内表在Join字段上建有索引；外表也叫“驱动表”，一般为小表，不仅相对其它表为小表，而且记录数的绝对值也较小，不要求有索引；

外表返回的每一行都要在内表中检索找到与它匹配的行，因此整个查询返回的结果集不能太大（大于1 万不适合）

若Join字段上没有索引查询优化器一般就不会选择 Nested Loop；

被连接的数据子集较小的情况，Nested Loop是个较好的选择。

嵌套循环的顺序扫描如下图：

嵌套循环的索引扫描如下图：

2、Merge Join（合并连接或归并连接）

Merge join的操作通常分三步：

1）对连接的每个表做table access full;

2）对table access full的结果进行排序；

3）进行merge join对排序结果进行合并；

在全表扫描比索引范围扫描再进行表访问更可取的情况下，Merge Join会比Nested Loop性能更佳；当表特别小或特别巨大的时候，实行全表访问可能会比索引范围扫描更有效；Merge Join的性能开销几乎都在前两步；Merge Join可适用于非等值Join（>，<，>=，<=，但是不包含!=，也即<>）；通常情况下Hash Join的效果都比排序合并连接要好，如果两表已经被排过序，在执行排序合并连接时不需要再排序了，这时Merge Join的性能会更优。

3、Hash Join（哈希连接或散列连接）

优化器使用两个比较的表，并利用连接键在内存中建立散列表，然后扫描较大的表并探测散列表，找出与散列表匹配的行，适用于较小的表完全可以放于内存中的情况，这样总成本就是访问两个表的成本之和；但如果表很大，不能完全放入内存，优化器会将它分割成若干不同的分区，把不能放入内存的部分写入磁盘的临时段，此时要有较大的临时段以便提高I/O的性能。只能应用于等值连接(如WHERE A.COL3 = B.COL4)，这是由Hash的特点决定的。哈希连接是归并连接的主要替代方案，哈希连接并不会对输入进行排序；能够很好的工作于没有索引的大表和并行查询的环境中，并提供最好的性能。是否比其他连接方式高效，取决于输入的内容是否经过排序。

二、物理连接的应用对比

类别	Nested Loop	Hash Join	Merge Join
使用条件	任何条件	等值连接（=）	等值或非等值连接(>，<，=，>=，<=)，‘<>’除外
相关资源	CPU、磁盘I/O	内存、临时空间	内存、临时空间
特点	当有高选择性索引或进行限制性搜索时效率比较高，能够快速返回第一次的搜索结果。	当缺乏索引或者索引条件模糊时，Hash Join比Nested Loop有效。通常比Merge Join快。在数据仓库环境下，如果表的记录数多，效率高。	当缺乏索引或者索引条件模糊时，Merge Join比Nested Loop有效。非等值连接时，Merge Join比Hash Join更有效
缺点	当索引丢失或者查询条件限制不够时，效率很低；当表的纪录数多时，效率低。	为建立哈希表，需要大量内存。第一次的结果返回较慢。	所有的表都需要排序。它为最优化的吞吐量而设计，并且在结果没有全部找到前不返回数据

三、实验：不同物理连接查询时间消耗对比

#构建测试数据库和表

create database db16;

\\c db16

You are now connected to database "db16" as user "postgres".

create table orders (

id serial,

placed date

);

create table items (

id serial,

order_id integer,

amount money

);

insert into orders(placed)

select current_date

from generate_series(1,1000000); #构造百万数据行

insert into items(order_id,amount)

select mod(s.id,1000000)+1, random()*100::money

from generate_series(1,10000000) as s(id); #构造千万数据行

alter table orders add constraint orders_pkey primary key(id);

alter table items add constraint items_pkey primary key(id);

alter table items add constraint items_id_fkey foreign key (order_id) references orders(id);

create index on items(order_id);

analyze orders;

analyze items;

#嵌套查询消耗（最佳执行计划）

explain analyze

select i.*

from orders o join items i on (o.id = i.order_id)

where o.id between 1 and 100;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------

Nested Loop (cost=0.86..4532.11 rows=1021 width=16) (actual time=0.020..2.421 rows=1000 loops=1)

-> Index Only Scan using orders_pkey on orders o (cost=0.42..10.69 rows=113 width=4) (actual time=0.011..0.039 rows=100 loops=1) --外表

Index Cond: ((id >= 1) AND (id <= 100))

Heap Fetches: 100

-> Index Scan using items_order_id_idx on items i (cost=0.43..39.92 rows=9 width=16) (actual time=0.002..0.020 rows=10 loops=100) #内表items

Index Cond: (order_id = o.id)

Planning time: 2.937 ms

Execution time: 3.037 ms

(8 rows)

#对比哈希连接查询消耗

set enable_nestloop=off;

explain analyze

select i.* from orders o join items i on (o.id = i.order_id)

where o.id between 1 and 100;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------

Hash Join (cost=12.10..191576.23 rows=1021 width=16) (actual time=0.086..4659.744 rows=1000 loops=1)

Hash Cond: (i.order_id = o.id)

-> Seq Scan on items i (cost=0.00..154054.22 rows=9999922 width=16) (actual time=0.011..2297.875 rows=10000000 loops=1)

-> Hash (cost=10.69..10.69 rows=113 width=4) (actual time=0.067..0.067 rows=100 loops=1)

Buckets: 1024 Batches: 1 Memory Usage: 7kB

-> Index Only Scan using orders_pkey on orders o (cost=0.42..10.69 rows=113 width=4) (actual time=0.009..0.041 rows=100 loops=1)

Index Cond: ((id >= 1) AND (id <= 100))

Heap Fetches: 100

Planning time: 0.268 ms

Execution time: 4659.941 ms

(10 rows)

#hash连接查询消耗时间非常多，此查询使用hash连接很低效，适用循环嵌套连接方式。

#另一个查询计划和时间消耗对比

#合并连接查询聚合消耗

explain analyze

select o.id, sum(i.amount)

from orders o join items i on (o.id = i.order_id)

group by o.id;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------

GroupAggregate (cost=2.58..653492.70 rows=1000000 width=12) (actual time=0.050..14463.784 rows=1000000 loops=1)

Group Key: o.id

-> Merge Join (cost=2.58..598293.68 rows=9039804 width=12) (actual time=0.026..10862.823 rows=10000000 loops=1)

Merge Cond: (o.id = i.order_id)

-> Index Only Scan using orders_pkey on orders o (cost=0.42..28220.42 rows=1000000 width=4) (actual time=0.015..424.372 rows=1000000 loops=1)

Heap Fetches: 1000000

-> Index Scan using items_order_id_idx on items i (cost=0.43..452205.19 rows=9999922 width=12) (actual time=0.007..5723.775 rows=10000000 loops=1)

Planning time: 0.897 ms

Execution time: 14653.730 ms

(9 rows)

#嵌套连接查询聚合消耗

set enable_nestloop=on;

set enable_mergejoin=off;

set enable_hashjoin=off;

explain analyze

select o.id, sum(i.amount)

from orders o join items i on (o.id = i.order_id)

group by o.id;

更多详细信息等登陆【瀚高支持平台查看】瀚高技术支持平台

以上是关于PG中的物理连接方式的主要内容，如果未能解决你的问题，请参考以下文章