Postgresql 表连接方法介绍(和Oracle对比测试)

Posted 2021-12-10 瀚高PG实验室

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Postgresql 表连接方法介绍(和Oracle对比测试)相关的知识，希望对你有一定的参考价值。

作者：杨云龙，瀚高PG实验室核心成员，数据库高级工程师，擅长HGDB、PostgreSQL、Oracle等主流数据库。

表连接方式方法介绍(Nested Loop/Hash Join/Merge Join/Join) 数据库版本(oracle11.2.0.4 and PostgreSQL 13.1)

环境构造
Nested Loop
- Oracle 执行举例
- Postgresql 执行举例
Hash Join
- Oracle举例
- Postgresql举例
Merge Join
- Oracle举例
- Postgresql举例
三种连接方法的优劣
Join连接类型
union 介绍
自连接
参考文档

环境构造

-- oracle11.2.0.4
--城市、国家
drop table country purge;
CREATE TABLE country (
country_id int primary key,
country_name VARCHAR(50) NOT NULL
);

drop table city purge;
CREATE TABLE city (
city_id int primary key,
city_name VARCHAR(50) NOT NULL,
country_id int NOT NULL
);

begin
for i in 1 .. 10 loop
insert into country values(i,'country'||i);
end loop;
commit;
end;

begin
for i in 1 .. 10000 loop
insert into city values(i,'city'||i,ceil(i/1000));
end loop;
commit;
end;

execute dbms_stats.gather_table_stats(ownname =>'SCOTT',tabname => 'CITY' ,estimate_percent=> 100 ,cascade => true);
execute dbms_stats.gather_table_stats(ownname =>'SCOTT',tabname => 'COUNTRY' ,estimate_percent=> 100 ,cascade => true);

-- PostgreSQL 13.1
drop table country;
CREATE TABLE country (
country_id integer primary key,
country_name text NOT NULL
);

drop table city purge;
CREATE TABLE city (
city_id integer primary key,
city_name text NOT NULL,
country_id integer NOT NULL
);

insert into country values (generate_series(1,10),'country'||generate_series(1,10));

insert into city values(generate_series(1,10000),'city'||generate_series(1,10000),ceil(random()*(10-1)+1));

analyze city;
analyze country;

Nested Loop

如上图所示，为Nested Loop方式介绍。其算法：驱动表返回一行数据，通过连接列传值给被驱动表，驱动表返回多少行，被驱动表就要被扫描多少次。 被驱动表索引方式索引唯一扫描或者范围扫描。在被驱动表数据子集较少的情况下，嵌套循环是比较好的选择，也就是适合结果集比较小的查询，通常超过10000行被认为大行，会变成低效。举例如下：

Oracle 执行举例

SQL> explain plan for select city_name,country_name from city,country where city.country_id=country.country_id and city.city_id=99;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 2738185913

---------------------------------------------------------------------------------------------
| Id  | Operation		     | Name	    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT	     |		    |	  1 |	 93 |	  2   (0)| 00:00:01 |
|   1 |  NESTED LOOPS		     |		    |	  1 |	 93 |	  2   (0)| 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID| CITY	    |	  1 |	 53 |	  1   (0)| 00:00:01 |
|*  3 |    INDEX UNIQUE SCAN	     | SYS_C0018242 |	  1 |	    |	  1   (0)| 00:00:01 |
|   4 |   TABLE ACCESS BY INDEX ROWID| COUNTRY	    |	  1 |	 40 |	  1   (0)| 00:00:01 |
|*  5 |    INDEX UNIQUE SCAN	     | SYS_C0018239 |	  1 |	    |	  0   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("CITY"."CITY_ID"=99)
   5 - access("CITY"."COUNTRY_ID"="COUNTRY"."COUNTRY_ID")

18 rows selected.

语句对比

--下面方式 oracle会选择 hash  全表扫描
explain plan for select city_name,country_name from city,country where city.country_id=country.country_id and city.city_id>=10;

--强制使用nested loop 方式
explain plan for select /*+ leading(city) use_nl(country) */ city_name,country_name from city,country
 where city.country_id=country.country_id and city.city_id>=10;

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 103883790

---------------------------------------------------------------------------------------------
| Id  | Operation		     | Name	    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT	     |		    |  9991 |	907K| 10005   (1)| 00:02:01 |
|   1 |  NESTED LOOPS		     |		    |  9991 |	907K| 10005   (1)| 00:02:01 |
|   2 |   NESTED LOOPS		     |		    |  9991 |	907K| 10005   (1)| 00:02:01 |
|*  3 |    TABLE ACCESS FULL	     | CITY	    |  9991 |	517K|	 11   (0)| 00:00:01 |
|*  4 |    INDEX UNIQUE SCAN	     | SYS_C0018239 |	  1 |	    |	  0   (0)| 00:00:01 |
|   5 |   TABLE ACCESS BY INDEX ROWID| COUNTRY	    |	  1 |	 40 |	  1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("CITY"."CITY_ID">=10)
   4 - access("CITY"."COUNTRY_ID"="COUNTRY"."COUNTRY_ID")

Note
-----
   - dynamic sampling used for this statement (level=2)

22 rows selected.
--由上所示，被驱动别执行了 9991次，也就是（10000-9），资源消耗明显上升

Postgresql 执行举例

mydb=# explain analyze select city_name,country_name from city,country where city.country_id=country.country_id and city.city_id=99;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.44..16.49 rows=1 width=40) (actual time=0.031..0.036 rows=1 loops=1)
   ->  Index Scan using city_pkey on city  (cost=0.29..8.30 rows=1 width=12) (actual time=0.015..0.017 rows=1 loops=1)
         Index Cond: (city_id = 99)
   ->  Index Scan using country_pkey on country  (cost=0.15..8.17 rows=1 width=36) (actual time=0.006..0.006 rows=1 loops=1)
         Index Cond: (country_id = city.country_id)
 Planning Time: 0.621 ms
 Execution Time: 0.098 ms
(7 rows)

--返回多行，强制使用nested loop,比默认选择hash join 多30ms
mydb=# set enable_hashjoin=off;
SET
mydb=# set enable_mergejoin=off;
SET
mydb=# explain analyze select city_name,country_name from city,country where city.country_id=country.country_id and city.city_id>=10;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.15..1934.29 rows=9991 width=40) (actual time=0.050..43.354 rows=9991 loops=1)
   ->  Seq Scan on city  (cost=0.00..188.00 rows=9991 width=12) (actual time=0.025..4.870 rows=9991 loops=1)
         Filter: (city_id >= 10)
         Rows Removed by Filter: 9
   ->  Index Scan using country_pkey on country  (cost=0.15..0.17 rows=1 width=36) (actual time=0.002..0.002 rows=1 loops=9991)
         Index Cond: (country_id = city.country_id)
 Planning Time: 0.393 ms
 Execution Time: 45.366 ms
(8 rows)

PG Nested loop说明

Postgresql使用嵌套循环执行查询，那么它可以通过迭代表1中的所有条目、迭代表2中的所有条目，然后在表1和表2中的行对满足筛选条件时发出一行。嵌套循环是Postgresql唯一可以用来处理任何连接的连接算法。例如下面语句，Oracle会选择Merge Join,Postgresql则选择 Nested Loop,详细可参考 Merge Join部分。

--即使数据量较大情况，以下语句pg还会选择Nested Loop
explain  select a.city_name,b.city_id from city a,city2 b where a.country_id<b.country_id;

Hash Join

两表关联时主要依靠哈希运算来得到结果集的表连接方式，只支持等值。算法：两表等值连接，返回大量数据，较少的表选为驱动表，将驱动表相关列读入PGA中的work area(PG则放入内存work_mem),然后对驱动表的连接列进行hash运算生成hash table，然后读取被驱动表并对关联列进行hash运算，然后到pga探测hash table，找到数据关联上。如果HASH表太大，无法一次构造在内存中，则分成若干个partition，写入磁盘的temporary segment，则会多一个写的代价，会降低效率

上图为Hash大概流程，如果统计信息等都准确，数据库会自动选择最好执行计划。当使用ORDERED提示时,FROM子句中的第一张表将用于建立哈希表。

Oracle举例

-- country 作为驱动表,不管使用那种join(left/right/full)方式，Oracle都会选择小表为驱动表。
SQL> explain plan for select city_name,country_name from city,country where city.country_id=country.country_id;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 114462077

------------------------------------------------------------------------------
| Id  | Operation	   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |	     | 10000 |	 781K|	  14   (0)| 00:00:01 |
|*  1 |  HASH JOIN	   |	     | 10000 |	 781K|	  14   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| COUNTRY |	  10 |	 400 |	   3   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| CITY    | 10000 |	 390K|	  11   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("CITY"."COUNTRY_ID"="COUNTRY"."COUNTRY_ID")

Note
-----
   - dynamic sampling used for this statement (level=2)

19 rows selected.

Postgresql举例

mydb=# explain analyze select city_name,country_name from city,country where city.country_id=country.country_id;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=38.58..227.91 rows=10000 width=40) (actual time=0.185..12.086 rows=10000 loops=1)
   Hash Cond: (city.country_id = country.country_id)
   ->  Seq Scan on city  (cost=0.00..163.00 rows=10000 width=12) (actual time=0.026..3.735 rows=10000 loops=1)
   ->  Hash  (cost=22.70..22.70 rows=1270 width=36) (actual time=0.064..0.067 rows=10 loops=1)
         Buckets: 2048  Batches: 1  Memory Usage: 17kB
         ->  Seq Scan on country  (cost=0.00..22.70 rows=1270 width=36) (actual time=0.025..0.031 rows=10 loops=1)
 Planning Time: 1.953 ms
 Execution Time: 13.983 ms
(8 rows)

为了可以使用小表建立hash table，优化器会转换,如 right join 被改为Left Join

mydb=# explain analyze select * from country right join city on city.country_id=country.country_id;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=38.58..227.91 rows=10000 width=52) (actual time=0.087..11.883 rows=10000 loops=1)
   Hash Cond: (city.country_id = country.country_id)
   ->  Seq Scan on city  (cost=0.00..163.00 rows=10000 width=16) (actual time=0.020..3.035 rows=10000 loops=1)
   ->  Hash  (cost=22.70..22.70 rows=1270 width=36) (actual time=0.044..0.047 rows=10 loops=1)
         Buckets: 2048  Batches: 1  Memory Usage: 17kB
         ->  Seq Scan on country  (cost=0.00..22.70 rows=1270 width=36) (actual time=0.024..0.028 rows=10 loops=1)
 Planning Time: 0.413 ms
 Execution Time: 13.588 ms
(8 rows)

Merge Join

两张表在做连接时用排序操作和合并操作来得到结果集的连接方式。排序主要处理非等值管理。算法：先对两张表根据连接列各自进行排序，嵌套循环是从被驱动表的索引中匹配数据，排序合并连接是从内存(PGA中的work area)中匹配数据，严格来说没有驱动表，可以认为较少表作为驱动表。HASH JOIN只需要把驱动表放入PGA中，但是排序连接合并需要将两表结果集均放入PGA中

上图为Merge大概流程，如果统计信息等都准确，数据库会自动选择最好执行计划。

Oracle举例

col PLAN_TABLE_OUTPUT for a100
set lines 200 pages 999
explain plan for select a.city_name,b.country_name from city a,country b where a.country_id<b.country_id;

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 1026867539

------------------------------------------------------------------------------------------------------
| Id  | Operation		      | Name	     | Rows  | Bytes |TempSpc| Cost (%CPU

   
 (c)2006-2024 SYSTEM All Rights Reserved  IT常识