为啥这个更复杂的查询比更简单的查询执行得更好?
Posted
技术标签:
【中文标题】为啥这个更复杂的查询比更简单的查询执行得更好?【英文标题】:Why this more complex query performs better than the simpler one?为什么这个更复杂的查询比更简单的查询执行得更好? 【发布时间】:2021-03-17 08:43:22 【问题描述】:我有一个装运订单表,其中包含 2 个 JSON 对象数组:声明的包裹和实际包裹。我想要的是获得所有声明的包裹和所有实际包裹的重量总和
更简单的 SQL
explain analyse select
id,
sum((tbl.decl)::double precision) as total_gross_weight,
sum((tbl.act)::double precision) as total_actual_weight
from
(
select
id,
json_array_elements(declared_packages)->> 'weight' as decl,
json_array_elements(actual_packages)->> 'weight'as act
from
"shipment-order" so) tbl
group by
id
order by total_gross_weight desc
返回
Sort (cost=162705.01..163957.01 rows=500800 width=32) (actual time=2350.293..2350.850 rows=4564 loops=1)
Sort Key: (sum(((((json_array_elements(so.declared_packages)) ->> 'weight'::text)))::double precision)) DESC
Sort Method: quicksort Memory: 543kB
-> GroupAggregate (cost=88286.58..103310.58 rows=500800 width=32) (actual time=2085.907..2348.947 rows=4564 loops=1)
Group Key: so.id
-> Sort (cost=88286.58..89538.58 rows=500800 width=80) (actual time=2085.895..2209.717 rows=1117847 loops=1)
Sort Key: so.id
Sort Method: external merge Disk: 28520kB
-> Result (cost=0.00..13615.16 rows=500800 width=80) (actual time=0.063..1744.941 rows=1117847 loops=1)
-> ProjectSet (cost=0.00..3599.16 rows=500800 width=80) (actual time=0.060..856.075 rows=1117847 loops=1)
-> Seq Scan on "shipment-order" so (cost=0.00..1045.08 rows=5008 width=233) (actual time=0.023..6.551 rows=5249 loops=1)
Planning time: 0.379 ms
Execution time: 2359.042 ms
而更复杂的SQL,基本上是经过多个阶段的左连接和交叉连接到原表
explain analyse
select
so.id,
total_gross_weight,
total_actual_weight
from
("shipment-order" so
left join (
select
so_1.id,
sum((d_packages.value ->> 'weight'::text)::double precision) as total_gross_weight
from
"shipment-order" so_1,
lateral json_array_elements(so_1.declared_packages) d_packages(value)
group by
so_1.id) declared_packages_info on
so.id = declared_packages_info.id
left join (
select
so_1.id,
sum((a_packages.value ->> 'weight'::text)::double precision) as total_actual_weight
from
"shipment-order" so_1,
lateral json_array_elements(so_1.actual_packages) a_packages(value)
group by
so_1.id) actual_packages_info on
so.id = actual_packages_info.id)
order by
total_gross_weight desc
表现更好
Sort (cost=35509.14..35521.66 rows=5008 width=32) (actual time=1823.049..1823.375 rows=5249 loops=1)
Sort Key: declared_packages_info.total_gross_weight DESC
Sort Method: quicksort Memory: 575kB
-> Hash Left Join (cost=34967.97..35201.40 rows=5008 width=32) (actual time=1819.214..1822.000 rows=5249 loops=1)
Hash Cond: (so.id = actual_packages_info.id)
-> Hash Left Join (cost=17484.13..17704.40 rows=5008 width=24) (actual time=1805.038..1806.996 rows=5249 loops=1)
Hash Cond: (so.id = declared_packages_info.id)
-> Index Only Scan using "PK_bcd4a660acbe66f71749270d38a" on "shipment-order" so (cost=0.28..207.40 rows=5008 width=16) (actual time=0.032..0.695 rows=5249 loops=1)
Heap Fetches: 146
-> Hash (cost=17421.24..17421.24 rows=5008 width=24) (actual time=1804.955..1804.957 rows=4553 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 312kB
-> Subquery Scan on declared_packages_info (cost=17321.08..17421.24 rows=5008 width=24) (actual time=1802.980..1804.261 rows=4553 loops=1)
-> HashAggregate (cost=17321.08..17371.16 rows=5008 width=24) (actual time=1802.979..1803.839 rows=4553 loops=1)
Group Key: so_1.id
-> Nested Loop (cost=0.00..11061.08 rows=500800 width=48) (actual time=0.033..902.972 rows=1117587 loops=1)
-> Seq Scan on "shipment-order" so_1 (cost=0.00..1045.08 rows=5008 width=149) (actual time=0.009..4.149 rows=5249 loops=1)
-> Function Scan on json_array_elements d_packages (cost=0.00..1.00 rows=100 width=32) (actual time=0.121..0.145 rows=213 loops=5249)
-> Hash (cost=17421.24..17421.24 rows=5008 width=24) (actual time=14.158..14.160 rows=1362 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 138kB
-> Subquery Scan on actual_packages_info (cost=17321.08..17421.24 rows=5008 width=24) (actual time=13.550..13.904 rows=1362 loops=1)
-> HashAggregate (cost=17321.08..17371.16 rows=5008 width=24) (actual time=13.549..13.783 rows=1362 loops=1)
Group Key: so_1_1.id
-> Nested Loop (cost=0.00..11061.08 rows=500800 width=48) (actual time=0.036..9.922 rows=1837 loops=1)
-> Seq Scan on "shipment-order" so_1_1 (cost=0.00..1045.08 rows=5008 width=100) (actual time=0.008..4.161 rows=5249 loops=1)
-> Function Scan on json_array_elements a_packages (cost=0.00..1.00 rows=100 width=32) (actual time=0.001..0.001 rows=0 loops=5249)
Planning time: 0.210 ms
Execution time: 1824.286 ms
我应该使用更复杂的查询还是应该尝试优化更简单的查询?我看到这个简单的查询有一个很长的外部合并排序......
【问题讨论】:
你使用的是什么版本的 PostgreSQL? PostgreSQL 10.14 on x86_64-pc-linux-gnu,由 x86_64-unknown-linux-gnu-gcc (GCC) 4.9.4 编译,64 位 【参考方案1】:您可以做两件事来加快简单查询:
不要使用大的jsonb
,而是将weight
存储在常规表列中
增加work_mem
,直到获得更便宜的哈希聚合
【讨论】:
感谢您的回答。更改架构会有点困难,因此增加 work_mem 也是如此,因为我没有对数据库的管理员访问权限。有没有我可以实施的索引策略来解决这个问题? 如果您需要表中的所有数据,索引没有帮助(很多),所以不,索引也无济于事。但是您可以在当前会话/事务中使用SET
更改work_mem
,而无需任何特殊权限。【参考方案2】:
假设“id”是主键或唯一键,您可以使用更简单的查询和辅助函数来加快速度。将每一行作为一个单元处理,而不是分解、汇集和重新聚合。
create function sum_weigh(json) returns double precision language sql as $$
select sum((t->>'weight')::double precision) from json_array_elements($1) f(t)
$$ immutable parallel safe;
select id, sum_weigh(declared_packages), sum_weigh(actual_packages) from "shipment-order";
【讨论】:
谢谢。 sum 函数被证明比简单查询更快更简单。仍然没有设法击败复杂的查询。我会测试更多看看以上是关于为啥这个更复杂的查询比更简单的查询执行得更好?的主要内容,如果未能解决你的问题,请参考以下文章