如何优化 PostgreSQL COUNT GROUP BY 查询?
Posted
技术标签:
【中文标题】如何优化 PostgreSQL COUNT GROUP BY 查询?【英文标题】:How to optimize PostgreSQL COUNT GROUP BY query? 【发布时间】:2015-10-18 16:13:24 【问题描述】:我有一个表 parameters_products 大约有 30 万条记录。 是否可以优化此查询?
SELECT parameter_id AS id,
COUNT(product_id) AS COUNT
FROM "parameters_products"
WHERE product_id IN
(SELECT product_id
FROM parameters_products
WHERE parameter_id IN ('2'))
GROUP BY parameter_id
查询输出:
2;274669
EXPLAIN ANALYZE VERBOSE... 输出:
HashAggregate (cost=23628.54..23628.56 rows=2 width=8) (actual time=2231.367..2231.368 rows=1 loops=1)
Output: parameters_products.parameter_id, count(parameters_products.product_id)
Group Key: parameters_products.parameter_id
-> Hash Semi Join (cost=9607.86..22256.43 rows=274421 width=8) (actual time=692.586..1893.261 rows=274669 loops=1)
Output: parameters_products.parameter_id, parameters_products.product_id
Hash Cond: (parameters_products.product_id = parameters_products_1.product_id)
-> Seq Scan on public.parameters_products (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.025..353.358 rows=299728 loops=1)
Output: parameters_products.parameter_id, parameters_products.product_id
-> Hash (cost=5105.60..5105.60 rows=274421 width=4) (actual time=692.331..692.331 rows=274669 loops=1)
Output: parameters_products_1.product_id
Buckets: 16384 Batches: 4 Memory Usage: 2425kB
-> Seq Scan on public.parameters_products parameters_products_1 (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.013..344.656 rows=274669 loops=1)
Output: parameters_products_1.product_id
Filter: (parameters_products_1.parameter_id = 2)
Rows Removed by Filter: 25059
Planning time: 0.279 ms
Execution time: 2231.499 ms
PostgreSQL 9.4.1 和 VACUUM 已启用。
刚试过这个查询,但也很慢:
SELECT pp1.parameter_id,
count(pp1.product_id)
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id
WHERE pp2.parameter_id IN (2)
GROUP BY pp1.parameter_id
--
HashAggregate (cost=23742.42..23742.44 rows=2 width=8) (actual time=2361.654..2361.654 rows=1 loops=1)
Output: pp1.parameter_id, count(pp1.product_id)
Group Key: pp1.parameter_id
-> Hash Join (cost=9607.86..22370.31 rows=274421 width=8) (actual time=715.409..2012.345 rows=274669 loops=1)
Output: pp1.parameter_id, pp1.product_id
Hash Cond: (pp1.product_id = pp2.product_id)
-> Seq Scan on public.parameters_products pp1 (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.012..360.789 rows=299728 loops=1)
Output: pp1.parameter_id, pp1.product_id
-> Hash (cost=5105.60..5105.60 rows=274421 width=4) (actual time=715.176..715.176 rows=274669 loops=1)
Output: pp2.product_id
Buckets: 16384 Batches: 4 Memory Usage: 2425kB
-> Seq Scan on public.parameters_products pp2 (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.009..353.386 rows=274669 loops=1)
Output: pp2.product_id
Filter: (pp2.parameter_id = 2)
Rows Removed by Filter: 25059
Planning time: 0.135 ms
Execution time: 2361.735 ms
索引:
CREATE INDEX parameters_products_parameter_id_idx
ON parameters_products
USING btree
(parameter_id);
CREATE INDEX parameters_products_product_id_idx
ON parameters_products
USING btree
(product_id);
CREATE INDEX parameters_products_product_id_parameter_id_idx
ON parameters_products
USING btree
(product_id, parameter_id);
EXPLAIN ANALYZE VERBOSE
SELECT pp1.parameter_id
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id
-
Hash Left Join (cost=9241.88..22699.06 rows=299728 width=4) (actual time=727.683..2080.798 rows=299728 loops=1)
Output: pp1.parameter_id
Hash Cond: (pp1.product_id = pp2.product_id)
-> Seq Scan on public.parameters_products pp1 (cost=0.00..4324.28 rows=299728 width=8) (actual time=0.031..355.656 rows=299728 loops=1)
Output: pp1.parameter_id, pp1.product_id
-> Hash (cost=4324.28..4324.28 rows=299728 width=4) (actual time=727.579..727.579 rows=299728 loops=1)
Output: pp2.product_id
Buckets: 16384 Batches: 4 Memory Usage: 2644kB
-> Seq Scan on public.parameters_products pp2 (cost=0.00..4324.28 rows=299728 width=4) (actual time=0.008..350.797 rows=299728 loops=1)
Output: pp2.product_id
Planning time: 0.472 ms
Execution time: 2392.582 ms
SET enable_seqscan = OFF;
减少了执行时间,但不显着。
【问题讨论】:
用JOIN
替换WHERE IN
@lad2025 执行时间:2361.735 ms
2361 ms
不是 2.36 seconds
?那么对于处理300k
记录,不是已经很好了吗?
注意 过滤器删除的行 占总行数的
看来不管做什么,都需要统计90%左右的记录。这将需要一些努力。如果优化这种类型的查询真的很重要,您可能需要实现触发器来预先汇总数据。
【参考方案1】:
我会尝试的第一件事是将IN
替换为EXISTS
:
SELECT parameter_id AS id,
COUNT(product_id) AS COUNT
FROM parameters_products pp
WHERE EXISTS (SELECT 1
FROM parameters_products pp2
WHERE pp2.product_id = pp.product_id AND
pp2.parameter_id = 2
)
GROUP BY parameter_id;
并且,请确保您在 parameters_products(product_id, parameter_id)
上有一个索引。
另一个想法是使用窗口函数:
select parameter_id, count(*)
from (select pp.*,
sum(case when pp.parameter_id = 2 then 1 else 0 end) over (partition by product_id) as cnt2
from parameters_products pp
) pp
where cnt2 > 0
group by parameter_id;
【讨论】:
索引已经存在:CREATE INDEX parameters_products_product_id_parameter_id_idx ON parameters_products USING btree (product_id, parameter_id);对于第一个查询“执行时间:2239.944 ms”和第二个“执行时间:2526.269 ms” 您可以尝试使用相反顺序的索引... INDEX on parameter_products(parameter_id, product_id)
。
@wildplasser 还是同一时间
将您的第一个查询与SET enable_seqscan = OFF;
一起使用,这减少了时间!谢谢。【参考方案2】:
试试:
SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT
FROM parameters_products pp1
JOIN
parameters_products pp2
ON
pp2.parameter_id = 2
AND
pp1.product_id = pp2.product_id
GROUP BY
pp1.parameter_id
将过滤条件从 WHERE 子句移至 ON 子句可减少 JOIN 中涉及的总行数。希望这展示了您在评论中看到的相同节省,使执行时间低于 1 秒。
【讨论】:
对不起,这是错误的查询。 JOIN 没有任何意义。结果将与以下内容相同:SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT FROM parameters_products pp1 GROUP BY pp1.parameter_id @nanolab 我已经更新了我的答案,并使用了内连接而不是左连接来准确重现 WHERE 子句的结果。我很抱歉忽略了这一点。 是的,现在是正确的。但是“执行时间:2249.975 ms”。似乎无法优化。 @nanolab:再测试一次。试试SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT FROM (SELECT product_id FROM parameters_product WHERE parameter_id = 2) AS pp2 JOIN parameters_products pp1 ON pp1.product_id = pp2.product_id GROUP BY pp1.parameter_id
这将使较小的表成为 JOIN 的驱动表,并可能让我们回到我们之前看到的亚秒级性能。
执行时间:2240.112 ms【参考方案3】:
freenode 上#postgresql 中的RhodiumToad 推荐了如下窗口函数。请注意,这与 Gordon Linoff 的窗口函数不同,它使用 bool_or 而不是 sum(case...):
SELECT parameter_id, count(product_id)
FROM
(SELECT *, bool_or(parameter_id = 2)
OVER
(partition by product_id) AS matching
FROM parameters_products) s
WHERE matching
GROUP BY parameter_id;
RhodiumToad 还提到 work_mem 参数对于这种规模的任何查询可能太小,无论是使用窗口函数、连接还是子选择。他建议增加 work_mem 参数以避免排序例程溢出到磁盘。
如果其中任何一个对您有所帮助,所有功劳归于 RhodiumToad。
【讨论】:
使用了这个 SET LOCAL work_mem = '500MB';但是查询比其他的还要慢:“Execution time: 3079.735 ms” @nanolab 你能发表解释分析吗? @nanolab parameter_id 的数据类型是什么? product_id 整数 NOT NULL,parameter_id 整数 NOT NULL,以上是关于如何优化 PostgreSQL COUNT GROUP BY 查询?的主要内容,如果未能解决你的问题,请参考以下文章
"HybridDB · 性能优化 · Count Distinct的几种实现方式” 读后感