Postgresql 表连接方法介绍(和Oracle对比测试)
Posted 瀚高PG实验室
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Postgresql 表连接方法介绍(和Oracle对比测试)相关的知识,希望对你有一定的参考价值。
作者:杨云龙,瀚高PG实验室核心成员,数据库高级工程师,擅长HGDB、PostgreSQL、Oracle等主流数据库。
表连接方式方法介绍(Nested Loop/Hash Join/Merge Join/Join) 数据库版本(oracle11.2.0.4 and PostgreSQL 13.1)
环境构造
-- oracle11.2.0.4
--城市、国家
drop table country purge;
CREATE TABLE country (
country_id int primary key,
country_name VARCHAR(50) NOT NULL
);
drop table city purge;
CREATE TABLE city (
city_id int primary key,
city_name VARCHAR(50) NOT NULL,
country_id int NOT NULL
);
begin
for i in 1 .. 10 loop
insert into country values(i,'country'||i);
end loop;
commit;
end;
begin
for i in 1 .. 10000 loop
insert into city values(i,'city'||i,ceil(i/1000));
end loop;
commit;
end;
execute dbms_stats.gather_table_stats(ownname =>'SCOTT',tabname => 'CITY' ,estimate_percent=> 100 ,cascade => true);
execute dbms_stats.gather_table_stats(ownname =>'SCOTT',tabname => 'COUNTRY' ,estimate_percent=> 100 ,cascade => true);
-- PostgreSQL 13.1
drop table country;
CREATE TABLE country (
country_id integer primary key,
country_name text NOT NULL
);
drop table city purge;
CREATE TABLE city (
city_id integer primary key,
city_name text NOT NULL,
country_id integer NOT NULL
);
insert into country values (generate_series(1,10),'country'||generate_series(1,10));
insert into city values(generate_series(1,10000),'city'||generate_series(1,10000),ceil(random()*(10-1)+1));
analyze city;
analyze country;
Nested Loop
如上图所示,为Nested Loop方式介绍。 其算法:驱动表返回一行数据,通过连接列传值给被驱动表,驱动表返回多少行,被驱动表就要被扫描多少次。 被驱动表索引方式索引唯一扫描或者范围扫描。在被驱动表数据子集较少的情况下,嵌套循环是比较好的选择,也就是适合结果集比较小的查询,通常超过10000行被认为大行,会变成低效。 举例如下:
Oracle 执行举例
SQL> explain plan for select city_name,country_name from city,country where city.country_id=country.country_id and city.city_id=99;
Explained.
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 2738185913
---------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 93 | 2 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | 1 | 93 | 2 (0)| 00:00:01 |
| 2 | TABLE ACCESS BY INDEX ROWID| CITY | 1 | 53 | 1 (0)| 00:00:01 |
|* 3 | INDEX UNIQUE SCAN | SYS_C0018242 | 1 | | 1 (0)| 00:00:01 |
| 4 | TABLE ACCESS BY INDEX ROWID| COUNTRY | 1 | 40 | 1 (0)| 00:00:01 |
|* 5 | INDEX UNIQUE SCAN | SYS_C0018239 | 1 | | 0 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("CITY"."CITY_ID"=99)
5 - access("CITY"."COUNTRY_ID"="COUNTRY"."COUNTRY_ID")
18 rows selected.
语句对比
--下面方式 oracle会选择 hash 全表扫描
explain plan for select city_name,country_name from city,country where city.country_id=country.country_id and city.city_id>=10;
--强制使用nested loop 方式
explain plan for select /*+ leading(city) use_nl(country) */ city_name,country_name from city,country
where city.country_id=country.country_id and city.city_id>=10;
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 103883790
---------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9991 | 907K| 10005 (1)| 00:02:01 |
| 1 | NESTED LOOPS | | 9991 | 907K| 10005 (1)| 00:02:01 |
| 2 | NESTED LOOPS | | 9991 | 907K| 10005 (1)| 00:02:01 |
|* 3 | TABLE ACCESS FULL | CITY | 9991 | 517K| 11 (0)| 00:00:01 |
|* 4 | INDEX UNIQUE SCAN | SYS_C0018239 | 1 | | 0 (0)| 00:00:01 |
| 5 | TABLE ACCESS BY INDEX ROWID| COUNTRY | 1 | 40 | 1 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("CITY"."CITY_ID">=10)
4 - access("CITY"."COUNTRY_ID"="COUNTRY"."COUNTRY_ID")
Note
-----
- dynamic sampling used for this statement (level=2)
22 rows selected.
--由上所示,被驱动别执行了 9991次,也就是(10000-9),资源消耗明显上升
Postgresql 执行举例
mydb=# explain analyze select city_name,country_name from city,country where city.country_id=country.country_id and city.city_id=99;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.44..16.49 rows=1 width=40) (actual time=0.031..0.036 rows=1 loops=1)
-> Index Scan using city_pkey on city (cost=0.29..8.30 rows=1 width=12) (actual time=0.015..0.017 rows=1 loops=1)
Index Cond: (city_id = 99)
-> Index Scan using country_pkey on country (cost=0.15..8.17 rows=1 width=36) (actual time=0.006..0.006 rows=1 loops=1)
Index Cond: (country_id = city.country_id)
Planning Time: 0.621 ms
Execution Time: 0.098 ms
(7 rows)
--返回多行,强制使用nested loop,比默认选择hash join 多30ms
mydb=# set enable_hashjoin=off;
SET
mydb=# set enable_mergejoin=off;
SET
mydb=# explain analyze select city_name,country_name from city,country where city.country_id=country.country_id and city.city_id>=10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.15..1934.29 rows=9991 width=40) (actual time=0.050..43.354 rows=9991 loops=1)
-> Seq Scan on city (cost=0.00..188.00 rows=9991 width=12) (actual time=0.025..4.870 rows=9991 loops=1)
Filter: (city_id >= 10)
Rows Removed by Filter: 9
-> Index Scan using country_pkey on country (cost=0.15..0.17 rows=1 width=36) (actual time=0.002..0.002 rows=1 loops=9991)
Index Cond: (country_id = city.country_id)
Planning Time: 0.393 ms
Execution Time: 45.366 ms
(8 rows)
PG Nested loop说明
Postgresql使用嵌套循环执行查询,那么它可以通过迭代表1中的所有条目、迭代表2中的所有条目,然后在表1和表2中的行对满足筛选条件时发出一行。嵌套循环是Postgresql唯一可以用来处理任何连接的连接算法。例如下面语句,Oracle会选择Merge Join,Postgresql则选择 Nested Loop,详细可参考 Merge Join部分。
--即使数据量较大情况,以下语句pg还会选择Nested Loop
explain select a.city_name,b.city_id from city a,city2 b where a.country_id<b.country_id;
Hash Join
两表关联时主要依靠哈希运算来得到结果集的表连接方式,只支持等值。 算法:两表等值连接,返回大量数据,较少的表选为驱动表,将驱动表相关列读入PGA中的work area(PG则放入内存work_mem),然后对驱动表的连接列进行hash运算生成hash table,然后读取被驱动表并对关联列进行hash运算,然后到pga探测hash table,找到数据关联上。如果HASH表太大,无法一次构造在内存中,则分成若干个partition,写入磁盘的temporary segment,则会多一个写的代价,会降低效率
上图为Hash大概流程,如果统计信息等都准确,数据库会自动选择最好执行计划。当使用ORDERED提示时,FROM子句中的第一张表将用于建立哈希表。
Oracle举例
-- country 作为驱动表,不管使用那种join(left/right/full)方式,Oracle都会选择小表为驱动表。
SQL> explain plan for select city_name,country_name from city,country where city.country_id=country.country_id;
Explained.
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 114462077
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10000 | 781K| 14 (0)| 00:00:01 |
|* 1 | HASH JOIN | | 10000 | 781K| 14 (0)| 00:00:01 |
| 2 | TABLE ACCESS FULL| COUNTRY | 10 | 400 | 3 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL| CITY | 10000 | 390K| 11 (0)| 00:00:01 |
------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("CITY"."COUNTRY_ID"="COUNTRY"."COUNTRY_ID")
Note
-----
- dynamic sampling used for this statement (level=2)
19 rows selected.
Postgresql举例
mydb=# explain analyze select city_name,country_name from city,country where city.country_id=country.country_id;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Hash Join (cost=38.58..227.91 rows=10000 width=40) (actual time=0.185..12.086 rows=10000 loops=1)
Hash Cond: (city.country_id = country.country_id)
-> Seq Scan on city (cost=0.00..163.00 rows=10000 width=12) (actual time=0.026..3.735 rows=10000 loops=1)
-> Hash (cost=22.70..22.70 rows=1270 width=36) (actual time=0.064..0.067 rows=10 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 17kB
-> Seq Scan on country (cost=0.00..22.70 rows=1270 width=36) (actual time=0.025..0.031 rows=10 loops=1)
Planning Time: 1.953 ms
Execution Time: 13.983 ms
(8 rows)
为了可以使用小表建立hash table,优化器会转换,如 right join 被改为Left Join
mydb=# explain analyze select * from country right join city on city.country_id=country.country_id;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=38.58..227.91 rows=10000 width=52) (actual time=0.087..11.883 rows=10000 loops=1)
Hash Cond: (city.country_id = country.country_id)
-> Seq Scan on city (cost=0.00..163.00 rows=10000 width=16) (actual time=0.020..3.035 rows=10000 loops=1)
-> Hash (cost=22.70..22.70 rows=1270 width=36) (actual time=0.044..0.047 rows=10 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 17kB
-> Seq Scan on country (cost=0.00..22.70 rows=1270 width=36) (actual time=0.024..0.028 rows=10 loops=1)
Planning Time: 0.413 ms
Execution Time: 13.588 ms
(8 rows)
Merge Join
两张表在做连接时用排序操作和合并操作来得到结果集的连接方式。排序主要处理非等值管理。算法:先对两张表根据连接列各自进行排序,嵌套循环是从被驱动表的索引中匹配数据,排序合并连接是从内存(PGA中的work area)中匹配数据,严格来说没有驱动表,可以认为较少表作为驱动表。HASH JOIN只需要把驱动表放入PGA中,但是排序连接合并需要将两表结果集均放入PGA中
上图为Merge大概流程,如果统计信息等都准确,数据库会自动选择最好执行计划。
Oracle举例
col PLAN_TABLE_OUTPUT for a100
set lines 200 pages 999
explain plan for select a.city_name,b.country_name from city a,country b where a.country_id<b.country_id;
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 1026867539
------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU如何用PL/SQL Developer连接ORACL数据库