加入大表时,postgres 查询速度慢
Posted
技术标签:
【中文标题】加入大表时,postgres 查询速度慢【英文标题】:Slow postgres query when joining large tables 【发布时间】:2013-03-27 17:12:53 【问题描述】:我的查询执行得很慢。我相信问题在于我要加入几个大表,但我仍然希望有更好的性能。查询和解释分析如下:
SELECT
"m_advertsnapshot"."id",
"m_advertsnapshot"."created",
"m_advertsnapshot"."modified",
"m_advertsnapshot"."snapshot_timestamp",
"m_advertsnapshot"."source_name",
COUNT(CASE m_advert.widget_listing_id IS NULL and m_advert.height IS NULL WHEN True THEN 1 ELSE null END) AS "adh_count_with_no_wl_and_missing_height",
COUNT(CASE m_advert.widget_listing_id IS NULL and m_advert.height IS NOT NULL and m_advert.colour_id IS NOT NULL and m_advert.ctype IS NOT NULL WHEN True THEN 1 ELSE null END) AS "adh_count_with_no_wl_and_has_height_plate_ctype",
COUNT(CASE m_advert.widget_listing_id IS NULL and m_advert.height IS NULL and m_advert.colour_id is NULL and m_advert.ctype is NULL WHEN True THEN 1 ELSE null END) AS "adh_count_with_no_wl_and_missing_height_and_missing_plate_c268",
COUNT("m_adverthistory"."id") AS "adh_count",
COUNT(CASE m_advert.widget_listing_id IS NULL and m_advert.height IS NULL and m_advert.colour_id is NULL WHEN True THEN 1 ELSE null END) AS "adh_count_with_no_wl_and_missing_height_and_missing_plate",
COUNT("m_advert"."widget_listing_id") AS "adh_count_with_wl"
FROM "m_advertsnapshot"
LEFT OUTER JOIN "m_adverthistory" ON ("m_advertsnapshot"."id" = "m_adverthistory"."advert_snapshot_id")
LEFT OUTER JOIN "m_advert" ON ("m_adverthistory"."advert_id" = "m_advert"."id")
GROUP BY
"m_advertsnapshot"."id",
"m_advertsnapshot"."created",
"m_advertsnapshot"."modified",
"m_advertsnapshot"."snapshot_timestamp",
"m_advertsnapshot"."source_name"
ORDER BY
"m_advertsnapshot"."snapshot_timestamp" DESC
"Sort (cost=796180.41..796180.90 rows=196 width=72) (actual time=18051.504..18051.519 rows=196 loops=1)"
" Sort Key: m_advertsnapshot.snapshot_timestamp"
" Sort Method: quicksort Memory: 60kB"
" -> HashAggregate (cost=796170.99..796172.95 rows=196 width=72) (actual time=18051.330..18051.396 rows=196 loops=1)"
" -> Hash Right Join (cost=227052.68..622950.33 rows=6298933 width=72) (actual time=2082.551..12166.226 rows=6298933 loops=1)"
" Hash Cond: (m_adverthistory.advert_snapshot_id = m_advertsnapshot.id)"
" -> Hash Left Join (cost=227045.27..536332.59 rows=6298933 width=24) (actual time=2082.483..9971.996 rows=6298933 loops=1)"
" Hash Cond: (m_adverthistory.advert_id = m_advert.id)"
" -> Seq Scan on m_adverthistory (cost=0.00..121858.33 rows=6298933 width=12) (actual time=0.003..1644.060 rows=6298933 loops=1)"
" -> Hash (cost=202575.12..202575.12 rows=1332812 width=20) (actual time=2080.897..2080.897 rows=1332812 loops=1)"
" Buckets: 2048 Batches: 128 Memory Usage: 525kB"
" -> Seq Scan on m_advert (cost=0.00..202575.12 rows=1332812 width=20) (actual time=0.007..1564.220 rows=1332812 loops=1)"
" -> Hash (cost=4.96..4.96 rows=196 width=52) (actual time=0.062..0.062 rows=196 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 17kB"
" -> Seq Scan on m_advertsnapshot (cost=0.00..4.96 rows=196 width=52) (actual time=0.004..0.030 rows=196 loops=1)"
"Total runtime: 18051.730 ms"
使用 postgres 9.2 查询需要 18 秒。表格大小为:
m_advertsnapshot - 196 rows
m_adverthistory - 6,298,933 rows
m_advert - 1,332,812 rows
DDL:
-- m_advertsnapshot
CREATE TABLE m_advertsnapshot
(
id serial NOT NULL,
snapshot_timestamp timestamp with time zone NOT NULL,
source_name character varying(50),
CONSTRAINT m_advertsnapshot_pkey PRIMARY KEY (id),
CONSTRAINT m_advertsnapshot_source_name_6a9a437077520191_uniq UNIQUE (source_name, snapshot_timestamp)
)
WITH (
OIDS=FALSE
);
CREATE INDEX m_advertsnapshot_snapshot_timestamp
ON m_advertsnapshot
USING btree
(snapshot_timestamp);
-- m_adverthistory
CREATE TABLE m_adverthistory
(
id serial NOT NULL,
advert_id integer NOT NULL,
advert_snapshot_id integer NOT NULL,
observed_timestamp timestamp with time zone NOT NULL,
CONSTRAINT m_adverthistory_pkey PRIMARY KEY (id),
CONSTRAINT advert_id_refs_id_30735d9eef85241c FOREIGN KEY (advert_id)
REFERENCES m_advert (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
CONSTRAINT advert_snapshot_id_refs_id_55d3986f4f270624 FOREIGN KEY (advert_snapshot_id)
REFERENCES m_advertsnapshot (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
CONSTRAINT m_adverthistory_advert_id_13fa0dae39e78983_uniq UNIQUE (advert_id, advert_snapshot_id)
)
WITH (
OIDS=FALSE
);
CREATE INDEX m_adverthistory_advert_id
ON m_adverthistory
USING btree
(advert_id);
CREATE INDEX m_adverthistory_advert_snapshot_id
ON m_adverthistory
USING btree
(advert_snapshot_id);
-- m_advert
CREATE TABLE m_advert
(
id serial NOT NULL,
widget_listing_id integer,
height integer,
ctype integer,
colour_id integer,
CONSTRAINT m_advert_pkey PRIMARY KEY (id),
CONSTRAINT "colour_id_refs_id_1e4e2dac0183b419" FOREIGN KEY (colour_id)
REFERENCES colour ("id") MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
CONSTRAINT widget_listing_id_refs_id_5a7e62d0d4f48013 FOREIGN KEY (widget_listing_id)
REFERENCES m_widgetlisting (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
)
WITH (
OIDS=FALSE
);
CREATE INDEX m_advert_advert_seller_id
ON m_advert
USING btree
(advert_seller_id);
CREATE INDEX m_advert_colour_id
ON m_advert
USING btree
(colour_id);
CREATE INDEX m_advert_widget_listing_id
ON m_advert
USING btree
(widget_listing_id);
任何关于如何提高性能的想法都将不胜感激。
谢谢!
【问题讨论】:
在我看来,id(或 advert_id 至少应该是 m_advert, m_advertsnapshot 的主键的一部分,您是否有任何主键或外键你的架构?请告诉我们 DDL 的 我添加了 DDL。连接在主键/外键上。这些是由 Django 生成的(虽然我不认为这有什么不同) 架构看起来很合理(对于实际上不需要索引的查询,并且某些索引已经被 FK 约束覆盖)Junction 表不需要需要 代理人(但不会造成伤害)。查询缓慢的真正原因是它需要所有表中的所有行来计算聚合。如果您需要 100% 的数据索引,则无济于事。添加额外的约束(例如,在 snapshot_timestamp >= some_date)可能会导致使用索引的不同计划。 你可以通过在这个查询中增加work_mem
来获得提升,给它更多的空间来进行哈希和排序。在查询之前尝试SET work_mem = '50MB'
,看看计划或性能是否发生变化。 不要在postgresql.conf
中设置。
您是否尝试过更改 History 上的两个索引,以便它们同时具有两个字段: advert_id 和 advert_snapshot_id ?拥有两个索引,包含两个字段 (advert_id, advert_snapshot_id) 和 (advert_snapshot_id, advert_id) 可能会有所帮助,因为第二个键可以从索引本身中获取。
【参考方案1】:
架构看起来很合理(对于实际上不需要索引的查询,并且某些索引已经被 FK 约束覆盖)
Junction 表不需要代理键(但不会有害)。
查询缓慢的真正原因是它需要所有表中的所有行来计算聚合。如果您需要 100% 的数据,索引就帮不上什么忙了。
添加额外的约束(例如,在 snapshot_timestamp >= some_date)可能会导致使用索引的不同计划。
【讨论】:
以上是关于加入大表时,postgres 查询速度慢的主要内容,如果未能解决你的问题,请参考以下文章