如何优化 PostgreSQL COUNT GROUP BY 查询?

Posted

技术标签:

【中文标题】如何优化 PostgreSQL COUNT GROUP BY 查询?【英文标题】:How to optimize PostgreSQL COUNT GROUP BY query? 【发布时间】:2015-10-18 16:13:24 【问题描述】:

我有一个表 parameters_products 大约有 30 万条记录。 是否可以优化此查询?

SELECT parameter_id AS id,
       COUNT(product_id) AS COUNT
FROM "parameters_products"
WHERE product_id IN
    (SELECT product_id
     FROM parameters_products
     WHERE parameter_id IN ('2'))
GROUP BY parameter_id

查询输出:

2;274669

EXPLAIN ANALYZE VERBOSE... 输出:

HashAggregate  (cost=23628.54..23628.56 rows=2 width=8) (actual time=2231.367..2231.368 rows=1 loops=1)
  Output: parameters_products.parameter_id, count(parameters_products.product_id)
  Group Key: parameters_products.parameter_id
  ->  Hash Semi Join  (cost=9607.86..22256.43 rows=274421 width=8) (actual time=692.586..1893.261 rows=274669 loops=1)
        Output: parameters_products.parameter_id, parameters_products.product_id
        Hash Cond: (parameters_products.product_id = parameters_products_1.product_id)
        ->  Seq Scan on public.parameters_products  (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.025..353.358 rows=299728 loops=1)
              Output: parameters_products.parameter_id, parameters_products.product_id
        ->  Hash  (cost=5105.60..5105.60 rows=274421 width=4) (actual time=692.331..692.331 rows=274669 loops=1)
              Output: parameters_products_1.product_id
              Buckets: 16384  Batches: 4  Memory Usage: 2425kB
              ->  Seq Scan on public.parameters_products parameters_products_1  (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.013..344.656 rows=274669 loops=1)
                    Output: parameters_products_1.product_id
                    Filter: (parameters_products_1.parameter_id = 2)
                    Rows Removed by Filter: 25059
Planning time: 0.279 ms
Execution time: 2231.499 ms

PostgreSQL 9.4.1 和 VACUUM 已启用。

刚试过这个查询,但也很慢:

SELECT pp1.parameter_id,
       count(pp1.product_id)
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id
WHERE pp2.parameter_id IN (2)
GROUP BY pp1.parameter_id

--

HashAggregate  (cost=23742.42..23742.44 rows=2 width=8) (actual time=2361.654..2361.654 rows=1 loops=1)
  Output: pp1.parameter_id, count(pp1.product_id)
  Group Key: pp1.parameter_id
  ->  Hash Join  (cost=9607.86..22370.31 rows=274421 width=8) (actual time=715.409..2012.345 rows=274669 loops=1)
        Output: pp1.parameter_id, pp1.product_id
        Hash Cond: (pp1.product_id = pp2.product_id)
        ->  Seq Scan on public.parameters_products pp1  (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.012..360.789 rows=299728 loops=1)
              Output: pp1.parameter_id, pp1.product_id
        ->  Hash  (cost=5105.60..5105.60 rows=274421 width=4) (actual time=715.176..715.176 rows=274669 loops=1)
              Output: pp2.product_id
              Buckets: 16384  Batches: 4  Memory Usage: 2425kB
              ->  Seq Scan on public.parameters_products pp2  (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.009..353.386 rows=274669 loops=1)
                    Output: pp2.product_id
                    Filter: (pp2.parameter_id = 2)
                    Rows Removed by Filter: 25059
Planning time: 0.135 ms
Execution time: 2361.735 ms

索引:

CREATE INDEX parameters_products_parameter_id_idx
  ON parameters_products
  USING btree
  (parameter_id);

CREATE INDEX parameters_products_product_id_idx
  ON parameters_products
  USING btree
  (product_id);

CREATE INDEX parameters_products_product_id_parameter_id_idx
  ON parameters_products
  USING btree
  (product_id, parameter_id);

EXPLAIN ANALYZE VERBOSE
SELECT pp1.parameter_id
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id

-

Hash Left Join  (cost=9241.88..22699.06 rows=299728 width=4) (actual time=727.683..2080.798 rows=299728 loops=1)
  Output: pp1.parameter_id
  Hash Cond: (pp1.product_id = pp2.product_id)
  ->  Seq Scan on public.parameters_products pp1  (cost=0.00..4324.28 rows=299728 width=8) (actual time=0.031..355.656 rows=299728 loops=1)
        Output: pp1.parameter_id, pp1.product_id
  ->  Hash  (cost=4324.28..4324.28 rows=299728 width=4) (actual time=727.579..727.579 rows=299728 loops=1)
        Output: pp2.product_id
        Buckets: 16384  Batches: 4  Memory Usage: 2644kB
        ->  Seq Scan on public.parameters_products pp2  (cost=0.00..4324.28 rows=299728 width=4) (actual time=0.008..350.797 rows=299728 loops=1)
              Output: pp2.product_id
Planning time: 0.472 ms
Execution time: 2392.582 ms

SET enable_seqscan = OFF;

减少了执行时间,但不显着。

【问题讨论】:

JOIN替换WHERE IN @lad2025 执行时间:2361.735 ms 2361 ms 不是 2.36 seconds?那么对于处理300k 记录,不是已经很好了吗? 注意 过滤器删除的行 占总行数的 看来不管做什么,都需要统计90%左右的记录。这将需要一些努力。如果优化这种类型的查询真的很重要,您可能需要实现触发器来预先汇总数据。 【参考方案1】:

我会尝试的第一件事是将IN 替换为EXISTS

SELECT parameter_id AS id,
       COUNT(product_id) AS COUNT
FROM parameters_products pp
WHERE EXISTS (SELECT 1
              FROM parameters_products pp2
              WHERE pp2.product_id = pp.product_id AND
                    pp2.parameter_id = 2
             ) 
GROUP BY parameter_id;

并且,请确保您在 parameters_products(product_id, parameter_id) 上有一个索引。

另一个想法是使用窗口函数:

select parameter_id, count(*)
from (select pp.*,
             sum(case when pp.parameter_id = 2 then 1 else 0 end) over (partition by product_id) as cnt2
      from parameters_products pp
     ) pp
where cnt2 > 0
group by parameter_id;

【讨论】:

索引已经存在:CREATE INDEX parameters_products_product_id_parameter_id_idx ON parameters_products USING btree (product_id, parameter_id);对于第一个查询“执行时间:2239.944 ms”和第二个“执行时间:2526.269 ms” 您可以尝试使用相反顺序的索引... INDEX on parameter_products(parameter_id, product_id) @wildplasser 还是同一时间 将您的第一个查询与SET enable_seqscan = OFF; 一起使用,这减少了时间!谢谢。【参考方案2】:

试试:

SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT
FROM parameters_products pp1
JOIN
  parameters_products pp2
ON
  pp2.parameter_id = 2
AND
  pp1.product_id = pp2.product_id
GROUP BY
  pp1.parameter_id

将过滤条件从 WHERE 子句移至 ON 子句可减少 JOIN 中涉及的总行数。希望这展示了您在评论中看到的相同节省,使执行时间低于 1 秒。

【讨论】:

对不起,这是错误的查询。 JOIN 没有任何意义。结果将与以下内容相同:SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT FROM parameters_products pp1 GROUP BY pp1.parameter_id @nanolab 我已经更新了我的答案,并使用了内连接而不是左连接来准确重现 WHERE 子句的结果。我很抱歉忽略了这一点。 是的,现在是正确的。但是“执行时间:2249.975 ms”。似乎无法优化。 @nanolab:再测试一次。试试SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT FROM (SELECT product_id FROM parameters_product WHERE parameter_id = 2) AS pp2 JOIN parameters_products pp1 ON pp1.product_id = pp2.product_id GROUP BY pp1.parameter_id 这将使较小的表成为 JOIN 的驱动表,并可能让我们回到我们之前看到的亚秒级性能。 执行时间:2240.112 ms【参考方案3】:

freenode 上#postgresql 中的RhodiumToad 推荐了如下窗口函数。请注意,这与 Gordon Linoff 的窗口函数不同,它使用 bool_or 而不是 sum(case...):

SELECT parameter_id, count(product_id)
FROM
  (SELECT *, bool_or(parameter_id = 2)
   OVER
   (partition by product_id) AS matching
   FROM parameters_products) s
WHERE matching
GROUP BY parameter_id;

RhodiumToad 还提到 work_mem 参数对于这种规模的任何查询可能太小,无论是使用窗口函数、连接还是子选择。他建议增加 work_mem 参数以避免排序例程溢出到磁盘。

如果其中任何一个对您有所帮助,所有功劳归于 RhodiumToad。

【讨论】:

使用了这个 SET LOCAL work_mem = '500MB';但是查询比其他的还要慢:“Execution time: 3079.735 ms” @nanolab 你能发表解释分析吗? @nanolab parameter_id 的数据类型是什么? product_id 整数 NOT NULL,parameter_id 整数 NOT NULL,

以上是关于如何优化 PostgreSQL COUNT GROUP BY 查询?的主要内容,如果未能解决你的问题,请参考以下文章

"HybridDB · 性能优化 · Count Distinct的几种实现方式” 读后感

django count(*) 慢查询优化

SQL优化 快速计算Distinct Count

优化一个非常大的 mysql 表(查询或 mysql)

选择 Count (distinct col) 查询以显示结果中的行数和列数 - postgresql

如何在 postgresql 中显示 6 的幂