加入大表时,postgres 查询速度慢

Posted

技术标签:

【中文标题】加入大表时,postgres 查询速度慢【英文标题】:Slow postgres query when joining large tables 【发布时间】:2013-03-27 17:12:53 【问题描述】:

我的查询执行得很慢。我相信问题在于我要加入几个大表,但我仍然希望有更好的性能。查询和解释分析如下:

SELECT
    "m_advertsnapshot"."id",
    "m_advertsnapshot"."created",
    "m_advertsnapshot"."modified",
    "m_advertsnapshot"."snapshot_timestamp",
    "m_advertsnapshot"."source_name",
    COUNT(CASE m_advert.widget_listing_id IS NULL and m_advert.height IS NULL WHEN True THEN 1 ELSE null END) AS "adh_count_with_no_wl_and_missing_height",
    COUNT(CASE m_advert.widget_listing_id IS NULL and m_advert.height IS NOT NULL and m_advert.colour_id IS NOT NULL and m_advert.ctype IS NOT NULL WHEN True THEN 1 ELSE null END) AS "adh_count_with_no_wl_and_has_height_plate_ctype",
    COUNT(CASE m_advert.widget_listing_id IS NULL and m_advert.height IS NULL and m_advert.colour_id is NULL and m_advert.ctype is NULL  WHEN True THEN 1 ELSE null END) AS "adh_count_with_no_wl_and_missing_height_and_missing_plate_c268",
    COUNT("m_adverthistory"."id") AS "adh_count",
    COUNT(CASE m_advert.widget_listing_id IS NULL and m_advert.height IS NULL and m_advert.colour_id is NULL WHEN True THEN 1 ELSE null END) AS "adh_count_with_no_wl_and_missing_height_and_missing_plate",
    COUNT("m_advert"."widget_listing_id") AS "adh_count_with_wl"
FROM "m_advertsnapshot"
    LEFT OUTER JOIN "m_adverthistory" ON ("m_advertsnapshot"."id" = "m_adverthistory"."advert_snapshot_id")
    LEFT OUTER JOIN "m_advert" ON ("m_adverthistory"."advert_id" = "m_advert"."id")
GROUP BY
    "m_advertsnapshot"."id",
    "m_advertsnapshot"."created",
    "m_advertsnapshot"."modified",
    "m_advertsnapshot"."snapshot_timestamp",
    "m_advertsnapshot"."source_name"
ORDER BY
    "m_advertsnapshot"."snapshot_timestamp" DESC



"Sort  (cost=796180.41..796180.90 rows=196 width=72) (actual time=18051.504..18051.519 rows=196 loops=1)"
"  Sort Key: m_advertsnapshot.snapshot_timestamp"
"  Sort Method: quicksort  Memory: 60kB"
"  ->  HashAggregate  (cost=796170.99..796172.95 rows=196 width=72) (actual time=18051.330..18051.396 rows=196 loops=1)"
"        ->  Hash Right Join  (cost=227052.68..622950.33 rows=6298933 width=72) (actual time=2082.551..12166.226 rows=6298933 loops=1)"
"              Hash Cond: (m_adverthistory.advert_snapshot_id = m_advertsnapshot.id)"
"              ->  Hash Left Join  (cost=227045.27..536332.59 rows=6298933 width=24) (actual time=2082.483..9971.996 rows=6298933 loops=1)"
"                    Hash Cond: (m_adverthistory.advert_id = m_advert.id)"
"                    ->  Seq Scan on m_adverthistory  (cost=0.00..121858.33 rows=6298933 width=12) (actual time=0.003..1644.060 rows=6298933 loops=1)"
"                    ->  Hash  (cost=202575.12..202575.12 rows=1332812 width=20) (actual time=2080.897..2080.897 rows=1332812 loops=1)"
"                          Buckets: 2048  Batches: 128  Memory Usage: 525kB"
"                          ->  Seq Scan on m_advert  (cost=0.00..202575.12 rows=1332812 width=20) (actual time=0.007..1564.220 rows=1332812 loops=1)"
"              ->  Hash  (cost=4.96..4.96 rows=196 width=52) (actual time=0.062..0.062 rows=196 loops=1)"
"                    Buckets: 1024  Batches: 1  Memory Usage: 17kB"
"                    ->  Seq Scan on m_advertsnapshot  (cost=0.00..4.96 rows=196 width=52) (actual time=0.004..0.030 rows=196 loops=1)"
"Total runtime: 18051.730 ms"

使用 postgres 9.2 查询需要 18 秒。表格大小为:

m_advertsnapshot - 196 rows
m_adverthistory - 6,298,933 rows
m_advert - 1,332,812 rows

DDL:

-- m_advertsnapshot

CREATE TABLE m_advertsnapshot
(
  id serial NOT NULL,
  snapshot_timestamp timestamp with time zone NOT NULL,
  source_name character varying(50),
  CONSTRAINT m_advertsnapshot_pkey PRIMARY KEY (id),
  CONSTRAINT m_advertsnapshot_source_name_6a9a437077520191_uniq UNIQUE (source_name, snapshot_timestamp)
)
WITH (
  OIDS=FALSE
);

CREATE INDEX m_advertsnapshot_snapshot_timestamp
  ON m_advertsnapshot
  USING btree
  (snapshot_timestamp);

-- m_adverthistory

CREATE TABLE m_adverthistory
(
  id serial NOT NULL,
  advert_id integer NOT NULL,
  advert_snapshot_id integer NOT NULL,
  observed_timestamp timestamp with time zone NOT NULL,
  CONSTRAINT m_adverthistory_pkey PRIMARY KEY (id),
  CONSTRAINT advert_id_refs_id_30735d9eef85241c FOREIGN KEY (advert_id)
      REFERENCES m_advert (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
  CONSTRAINT advert_snapshot_id_refs_id_55d3986f4f270624 FOREIGN KEY (advert_snapshot_id)
      REFERENCES m_advertsnapshot (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
  CONSTRAINT m_adverthistory_advert_id_13fa0dae39e78983_uniq UNIQUE (advert_id, advert_snapshot_id)
)
WITH (
  OIDS=FALSE
);

CREATE INDEX m_adverthistory_advert_id
  ON m_adverthistory
  USING btree
  (advert_id);

CREATE INDEX m_adverthistory_advert_snapshot_id
  ON m_adverthistory
  USING btree
  (advert_snapshot_id);

-- m_advert

CREATE TABLE m_advert
(
  id serial NOT NULL,
  widget_listing_id integer,
  height integer,
  ctype integer,
  colour_id integer,
  CONSTRAINT m_advert_pkey PRIMARY KEY (id),
  CONSTRAINT "colour_id_refs_id_1e4e2dac0183b419" FOREIGN KEY (colour_id)
      REFERENCES colour ("id") MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
  CONSTRAINT widget_listing_id_refs_id_5a7e62d0d4f48013 FOREIGN KEY (widget_listing_id)
      REFERENCES m_widgetlisting (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,

)
WITH (
  OIDS=FALSE
);

CREATE INDEX m_advert_advert_seller_id
  ON m_advert
  USING btree
  (advert_seller_id);

CREATE INDEX m_advert_colour_id
  ON m_advert
  USING btree
  (colour_id);

CREATE INDEX m_advert_widget_listing_id
  ON m_advert
  USING btree
  (widget_listing_id);

任何关于如何提高性能的想法都将不胜感激。

谢谢!

【问题讨论】:

在我看来,id(或 advert_id 至少应该是 m_advert, m_advertsnapshot 的主键的一部分,您是否任何主键或外键你的架构?请告诉我们 DDL 的 我添加了 DDL。连接在主键/外键上。这些是由 Django 生成的(虽然我不认为这有什么不同) 架构看起来很合理(对于实际上不需要索引的查询,并且某些索引已经被 FK 约束覆盖)Junction 表不需要需要 代理人(但不会造成伤害)。查询缓慢的真正原因是它需要所有表中的所有行来计算聚合。如果您需要 100% 的数据索引,则无济于事。添加额外的约束(例如,在 snapshot_timestamp >= some_date)可能会导致使用索引的不同计划。 你可以通过在这个查询中增加work_mem来获得提升,给它更多的空间来进行哈希和排序。在查询之前尝试SET work_mem = '50MB',看看计划或性能是否发生变化。 不要postgresql.conf中设置。 您是否尝试过更改 History 上的两个索引,以便它们同时具有两个字段: advert_id 和 advert_snapshot_id ?拥有两个索引,包含两个字段 (advert_id, advert_snapshot_id) 和 (advert_snapshot_id, advert_id) 可能会有所帮助,因为第二个键可以从索引本身中获取。 【参考方案1】: 架构看起来很合理(对于实际上不需要索引的查询,并且某些索引已经被 FK 约束覆盖) Junction 表不需要代理键(但不会有害)。 查询缓慢的真正原因是它需要所有表中的所有行来计算聚合。如果您需要 100% 的数据,索引就帮不上什么忙了。 添加额外的约束(例如,在 snapshot_timestamp >= some_date)可能会导致使用索引的不同计划。

【讨论】:

以上是关于加入大表时,postgres 查询速度慢的主要内容,如果未能解决你的问题,请参考以下文章

postgres 更新加入速度慢

为啥NATIVE FOR ORACLE查询速度慢

sqlserver 链接服务器 连接db2 查询速度慢

MySQL中like查询速度慢的问题

mysql查询速度慢

如何解决mysql 查询和更新速度慢