为啥这个更复杂的查询比更简单的查询执行得更好？

Posted 2023-04-15

技术标签:

【中文标题】为啥这个更复杂的查询比更简单的查询执行得更好？【英文标题】：Why this more complex query performs better than the simpler one?为什么这个更复杂的查询比更简单的查询执行得更好？ 【发布时间】：2021-03-17 08:43:22 【问题描述】：

我有一个装运订单表，其中包含 2 个 JSON 对象数组：声明的包裹和实际包裹。我想要的是获得所有声明的包裹和所有实际包裹的重量总和

更简单的 SQL

explain analyse select
    id,
    sum((tbl.decl)::double precision) as total_gross_weight,
    sum((tbl.act)::double precision) as total_actual_weight
from
    (
    select
        id,
        json_array_elements(declared_packages)->> 'weight' as decl,
        json_array_elements(actual_packages)->> 'weight'as act
    from
        "shipment-order" so) tbl
group by
    id
order by total_gross_weight desc

Sort  (cost=162705.01..163957.01 rows=500800 width=32) (actual time=2350.293..2350.850 rows=4564 loops=1)
  Sort Key: (sum(((((json_array_elements(so.declared_packages)) ->> 'weight'::text)))::double precision)) DESC
  Sort Method: quicksort  Memory: 543kB
  ->  GroupAggregate  (cost=88286.58..103310.58 rows=500800 width=32) (actual time=2085.907..2348.947 rows=4564 loops=1)
        Group Key: so.id
        ->  Sort  (cost=88286.58..89538.58 rows=500800 width=80) (actual time=2085.895..2209.717 rows=1117847 loops=1)
              Sort Key: so.id
              Sort Method: external merge  Disk: 28520kB
              ->  Result  (cost=0.00..13615.16 rows=500800 width=80) (actual time=0.063..1744.941 rows=1117847 loops=1)
                    ->  ProjectSet  (cost=0.00..3599.16 rows=500800 width=80) (actual time=0.060..856.075 rows=1117847 loops=1)
                          ->  Seq Scan on "shipment-order" so  (cost=0.00..1045.08 rows=5008 width=233) (actual time=0.023..6.551 rows=5249 loops=1)
Planning time: 0.379 ms
Execution time: 2359.042 ms

而更复杂的SQL，基本上是经过多个阶段的左连接和交叉连接到原表

explain analyse
select
    so.id,
    total_gross_weight,
    total_actual_weight
from
    ("shipment-order" so
left join (
    select
        so_1.id,
        sum((d_packages.value ->> 'weight'::text)::double precision) as total_gross_weight
    from
        "shipment-order" so_1,
        lateral json_array_elements(so_1.declared_packages) d_packages(value)
    group by
        so_1.id) declared_packages_info on
    so.id = declared_packages_info.id
left join (
    select
        so_1.id,
        sum((a_packages.value ->> 'weight'::text)::double precision) as total_actual_weight
    from
        "shipment-order" so_1,
        lateral json_array_elements(so_1.actual_packages) a_packages(value)
    group by
        so_1.id) actual_packages_info on
    so.id = actual_packages_info.id)
order by
    total_gross_weight desc

表现更好

Sort  (cost=35509.14..35521.66 rows=5008 width=32) (actual time=1823.049..1823.375 rows=5249 loops=1)
  Sort Key: declared_packages_info.total_gross_weight DESC
  Sort Method: quicksort  Memory: 575kB
  ->  Hash Left Join  (cost=34967.97..35201.40 rows=5008 width=32) (actual time=1819.214..1822.000 rows=5249 loops=1)
        Hash Cond: (so.id = actual_packages_info.id)
        ->  Hash Left Join  (cost=17484.13..17704.40 rows=5008 width=24) (actual time=1805.038..1806.996 rows=5249 loops=1)
              Hash Cond: (so.id = declared_packages_info.id)
              ->  Index Only Scan using "PK_bcd4a660acbe66f71749270d38a" on "shipment-order" so  (cost=0.28..207.40 rows=5008 width=16) (actual time=0.032..0.695 rows=5249 loops=1)
                    Heap Fetches: 146
              ->  Hash  (cost=17421.24..17421.24 rows=5008 width=24) (actual time=1804.955..1804.957 rows=4553 loops=1)
                    Buckets: 8192  Batches: 1  Memory Usage: 312kB
                    ->  Subquery Scan on declared_packages_info  (cost=17321.08..17421.24 rows=5008 width=24) (actual time=1802.980..1804.261 rows=4553 loops=1)
                          ->  HashAggregate  (cost=17321.08..17371.16 rows=5008 width=24) (actual time=1802.979..1803.839 rows=4553 loops=1)
                                Group Key: so_1.id
                                ->  Nested Loop  (cost=0.00..11061.08 rows=500800 width=48) (actual time=0.033..902.972 rows=1117587 loops=1)
                                      ->  Seq Scan on "shipment-order" so_1  (cost=0.00..1045.08 rows=5008 width=149) (actual time=0.009..4.149 rows=5249 loops=1)
                                      ->  Function Scan on json_array_elements d_packages  (cost=0.00..1.00 rows=100 width=32) (actual time=0.121..0.145 rows=213 loops=5249)
        ->  Hash  (cost=17421.24..17421.24 rows=5008 width=24) (actual time=14.158..14.160 rows=1362 loops=1)
              Buckets: 8192  Batches: 1  Memory Usage: 138kB
              ->  Subquery Scan on actual_packages_info  (cost=17321.08..17421.24 rows=5008 width=24) (actual time=13.550..13.904 rows=1362 loops=1)
                    ->  HashAggregate  (cost=17321.08..17371.16 rows=5008 width=24) (actual time=13.549..13.783 rows=1362 loops=1)
                          Group Key: so_1_1.id
                          ->  Nested Loop  (cost=0.00..11061.08 rows=500800 width=48) (actual time=0.036..9.922 rows=1837 loops=1)
                                ->  Seq Scan on "shipment-order" so_1_1  (cost=0.00..1045.08 rows=5008 width=100) (actual time=0.008..4.161 rows=5249 loops=1)
                                ->  Function Scan on json_array_elements a_packages  (cost=0.00..1.00 rows=100 width=32) (actual time=0.001..0.001 rows=0 loops=5249)
Planning time: 0.210 ms
Execution time: 1824.286 ms

我应该使用更复杂的查询还是应该尝试优化更简单的查询？我看到这个简单的查询有一个很长的外部合并排序......

【问题讨论】：

你使用的是什么版本的 PostgreSQL？ PostgreSQL 10.14 on x86_64-pc-linux-gnu，由 x86_64-unknown-linux-gnu-gcc (GCC) 4.9.4 编译，64 位 【参考方案1】：

您可以做两件事来加快简单查询：

不要使用大的jsonb，而是将weight 存储在常规表列中

增加work_mem，直到获得更便宜的哈希聚合

【讨论】：

感谢您的回答。更改架构会有点困难，因此增加 work_mem 也是如此，因为我没有对数据库的管理员访问权限。有没有我可以实施的索引策略来解决这个问题？如果您需要表中的所有数据，索引没有帮助（很多），所以不，索引也无济于事。但是您可以在当前会话/事务中使用SET 更改work_mem，而无需任何特殊权限。【参考方案2】：

假设“id”是主键或唯一键，您可以使用更简单的查询和辅助函数来加快速度。将每一行作为一个单元处理，而不是分解、汇集和重新聚合。

create function sum_weigh(json) returns double precision language sql as $$
    select sum((t->>'weight')::double precision) from json_array_elements($1) f(t)
$$ immutable parallel safe;

select id, sum_weigh(declared_packages), sum_weigh(actual_packages) from "shipment-order";

【讨论】：

谢谢。 sum 函数被证明比简单查询更快更简单。仍然没有设法击败复杂的查询。我会测试更多看看

以上是关于为啥这个更复杂的查询比更简单的查询执行得更好？的主要内容，如果未能解决你的问题，请参考以下文章