为啥这个更复杂的查询比更简单的查询执行得更好?

Posted

技术标签:

【中文标题】为啥这个更复杂的查询比更简单的查询执行得更好?【英文标题】:Why this more complex query performs better than the simpler one?为什么这个更复杂的查询比更简单的查询执行得更好? 【发布时间】:2021-03-17 08:43:22 【问题描述】:

我有一个装运订单表,其中包含 2 个 JSON 对象数组:声明的包裹和实际包裹。我想要的是获得所有声明的包裹和所有实际包裹的重量总和

更简单的 SQL

explain analyse select
    id,
    sum((tbl.decl)::double precision) as total_gross_weight,
    sum((tbl.act)::double precision) as total_actual_weight
from
    (
    select
        id,
        json_array_elements(declared_packages)->> 'weight' as decl,
        json_array_elements(actual_packages)->> 'weight'as act
    from
        "shipment-order" so) tbl
group by
    id
order by total_gross_weight desc

返回

Sort  (cost=162705.01..163957.01 rows=500800 width=32) (actual time=2350.293..2350.850 rows=4564 loops=1)
  Sort Key: (sum(((((json_array_elements(so.declared_packages)) ->> 'weight'::text)))::double precision)) DESC
  Sort Method: quicksort  Memory: 543kB
  ->  GroupAggregate  (cost=88286.58..103310.58 rows=500800 width=32) (actual time=2085.907..2348.947 rows=4564 loops=1)
        Group Key: so.id
        ->  Sort  (cost=88286.58..89538.58 rows=500800 width=80) (actual time=2085.895..2209.717 rows=1117847 loops=1)
              Sort Key: so.id
              Sort Method: external merge  Disk: 28520kB
              ->  Result  (cost=0.00..13615.16 rows=500800 width=80) (actual time=0.063..1744.941 rows=1117847 loops=1)
                    ->  ProjectSet  (cost=0.00..3599.16 rows=500800 width=80) (actual time=0.060..856.075 rows=1117847 loops=1)
                          ->  Seq Scan on "shipment-order" so  (cost=0.00..1045.08 rows=5008 width=233) (actual time=0.023..6.551 rows=5249 loops=1)
Planning time: 0.379 ms
Execution time: 2359.042 ms

而更复杂的SQL,基本上是经过多个阶段的左连接和交叉连接到原表

explain analyse
select
    so.id,
    total_gross_weight,
    total_actual_weight
from
    ("shipment-order" so
left join (
    select
        so_1.id,
        sum((d_packages.value ->> 'weight'::text)::double precision) as total_gross_weight
    from
        "shipment-order" so_1,
        lateral json_array_elements(so_1.declared_packages) d_packages(value)
    group by
        so_1.id) declared_packages_info on
    so.id = declared_packages_info.id
left join (
    select
        so_1.id,
        sum((a_packages.value ->> 'weight'::text)::double precision) as total_actual_weight
    from
        "shipment-order" so_1,
        lateral json_array_elements(so_1.actual_packages) a_packages(value)
    group by
        so_1.id) actual_packages_info on
    so.id = actual_packages_info.id)
order by
    total_gross_weight desc

表现更好

Sort  (cost=35509.14..35521.66 rows=5008 width=32) (actual time=1823.049..1823.375 rows=5249 loops=1)
  Sort Key: declared_packages_info.total_gross_weight DESC
  Sort Method: quicksort  Memory: 575kB
  ->  Hash Left Join  (cost=34967.97..35201.40 rows=5008 width=32) (actual time=1819.214..1822.000 rows=5249 loops=1)
        Hash Cond: (so.id = actual_packages_info.id)
        ->  Hash Left Join  (cost=17484.13..17704.40 rows=5008 width=24) (actual time=1805.038..1806.996 rows=5249 loops=1)
              Hash Cond: (so.id = declared_packages_info.id)
              ->  Index Only Scan using "PK_bcd4a660acbe66f71749270d38a" on "shipment-order" so  (cost=0.28..207.40 rows=5008 width=16) (actual time=0.032..0.695 rows=5249 loops=1)
                    Heap Fetches: 146
              ->  Hash  (cost=17421.24..17421.24 rows=5008 width=24) (actual time=1804.955..1804.957 rows=4553 loops=1)
                    Buckets: 8192  Batches: 1  Memory Usage: 312kB
                    ->  Subquery Scan on declared_packages_info  (cost=17321.08..17421.24 rows=5008 width=24) (actual time=1802.980..1804.261 rows=4553 loops=1)
                          ->  HashAggregate  (cost=17321.08..17371.16 rows=5008 width=24) (actual time=1802.979..1803.839 rows=4553 loops=1)
                                Group Key: so_1.id
                                ->  Nested Loop  (cost=0.00..11061.08 rows=500800 width=48) (actual time=0.033..902.972 rows=1117587 loops=1)
                                      ->  Seq Scan on "shipment-order" so_1  (cost=0.00..1045.08 rows=5008 width=149) (actual time=0.009..4.149 rows=5249 loops=1)
                                      ->  Function Scan on json_array_elements d_packages  (cost=0.00..1.00 rows=100 width=32) (actual time=0.121..0.145 rows=213 loops=5249)
        ->  Hash  (cost=17421.24..17421.24 rows=5008 width=24) (actual time=14.158..14.160 rows=1362 loops=1)
              Buckets: 8192  Batches: 1  Memory Usage: 138kB
              ->  Subquery Scan on actual_packages_info  (cost=17321.08..17421.24 rows=5008 width=24) (actual time=13.550..13.904 rows=1362 loops=1)
                    ->  HashAggregate  (cost=17321.08..17371.16 rows=5008 width=24) (actual time=13.549..13.783 rows=1362 loops=1)
                          Group Key: so_1_1.id
                          ->  Nested Loop  (cost=0.00..11061.08 rows=500800 width=48) (actual time=0.036..9.922 rows=1837 loops=1)
                                ->  Seq Scan on "shipment-order" so_1_1  (cost=0.00..1045.08 rows=5008 width=100) (actual time=0.008..4.161 rows=5249 loops=1)
                                ->  Function Scan on json_array_elements a_packages  (cost=0.00..1.00 rows=100 width=32) (actual time=0.001..0.001 rows=0 loops=5249)
Planning time: 0.210 ms
Execution time: 1824.286 ms

我应该使用更复杂的查询还是应该尝试优化更简单的查询?我看到这个简单的查询有一个很长的外部合并排序......

【问题讨论】:

你使用的是什么版本的 PostgreSQL? PostgreSQL 10.14 on x86_64-pc-linux-gnu,由 x86_64-unknown-linux-gnu-gcc (GCC) 4.9.4 编译,64 位 【参考方案1】:

您可以做两件事来加快简单查询:

不要使用大的jsonb,而是将weight 存储在常规表列中

增加work_mem,直到获得更便宜的哈希聚合

【讨论】:

感谢您的回答。更改架构会有点困难,因此增加 work_mem 也是如此,因为我没有对数据库的管理员访问权限。有没有我可以实施的索引策略来解决这个问题? 如果您需要表中的所有数据,索引没有帮助(很多),所以不,索引也无济于事。但是您可以在当前会话/事务中使用SET 更改work_mem,而无需任何特殊权限。【参考方案2】:

假设“id”是主键或唯一键,您可以使用更简单的查询和辅助函数来加快速度。将每一行作为一个单元处理,而不是分解、汇集和重新聚合。

create function sum_weigh(json) returns double precision language sql as $$
    select sum((t->>'weight')::double precision) from json_array_elements($1) f(t)
$$ immutable parallel safe;

select id, sum_weigh(declared_packages), sum_weigh(actual_packages) from "shipment-order";

【讨论】:

谢谢。 sum 函数被证明比简单查询更快更简单。仍然没有设法击败复杂的查询。我会测试更多看看

以上是关于为啥这个更复杂的查询比更简单的查询执行得更好?的主要内容,如果未能解决你的问题,请参考以下文章

为啥 Entity Framework 6 会为简单的查找生成复杂的 SQL 查询?

为啥这个简单的左连接需要永远执行?

一个复杂查询与多个简单查询

MySQL:为啥简单查询不使用索引,执行文件排序

为啥 MySQL 会挂在这个简单的子查询上?

为啥这个简单的“INSERT INTO”查询不起作用