为 Django 全文搜索创建索引
Posted
技术标签:
【中文标题】为 Django 全文搜索创建索引【英文标题】:Creating a index for a Django Full Text Search 【发布时间】:2021-11-13 02:55:18 【问题描述】:我正在使用 Django 3.2 和 PostgreSQL 12.8 在博客上实现全文搜索。我有一个包含 3.000 个帖子的数据库,我的搜索栏搜索 post_title
、post_subtitle
和 post_text
。此搜索具有权重、排名和分页。搜索工作就像一个魅力,但它有点慢。 Django 正在执行的确切查询是:
SELECT "core_post"."id", "core_post"."blog_name",
"core_post"."post_url", "core_post"."post_title", "core_post"."post_subtitle",
"core_post"."post_text",
ts_rank(((setweight(to_tsvector(COALESCE("core_post"."post_title", '')), 'A') ||
setweight(to_tsvector(COALESCE("core_post"."post_subtitle", '')), 'B')) ||
setweight(to_tsvector(COALESCE("core_post"."post_text", '')), 'C')),
plainto_tsquery('Angel'))
AS "rank" FROM "core_post" WHERE
ts_rank(((setweight(to_tsvector(COALESCE("core_post"."post_title", '')), 'A') ||
setweight(to_tsvector(COALESCE("core_post"."post_subtitle", '')), 'B')) ||
setweight(to_tsvector(COALESCE("core_post"."post_text", '')), 'C')),
plainto_tsquery('Angel')) >= 0.3
ORDER BY "rank" DESC LIMIT 15
当我explain analyse
它时,我得到了这个:
Limit (cost=26321.90..26323.63 rows=15 width=256) (actual time=662.709..664.002 rows=15 loops=1)
-> Gather Merge (cost=26321.90..26998.33 rows=5882 width=256) (actual time=662.706..663.998 rows=15 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Sort (cost=25321.89..25336.60 rows=5882 width=256) (actual time=656.142..656.144 rows=12 loops=2)
Sort Key: (ts_rank(((setweight(to_tsvector((COALESCE(post_title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector(COALESCE(post_subtitle, ''::text)), 'B'::"char")) || setweight(to_tsvector(COALESCE(post_text, ''::text)), 'C'::"char")), plainto_tsquery('Angel'::text))) DESC
Sort Method: top-N heapsort Memory: 33kB
Worker 0: Sort Method: top-N heapsort Memory: 32kB
-> Parallel Seq Scan on core_post (cost=0.00..25177.58 rows=5882 width=256) (actual time=6.758..655.854 rows=90 loops=2)
Filter: (ts_rank(((setweight(to_tsvector((COALESCE(post_title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector(COALESCE(post_subtitle, ''::text)), 'B'::"char")) || setweight(to_tsvector(COALESCE(post_text, ''::text)), 'C'::"char")), plainto_tsquery('Angel'::text)) >= '0.3'::double precision)
Rows Removed by Filter: 14910
Planning Time: 0.345 ms
Execution Time: 664.065 ms
我不太擅长 SQL 或 PostgreSQL,但我创建了一个索引,如下所示,基于 docs:
create index search_view_idx
on core_post
using gin(
to_tsvector('english', COALESCE("core_post"."post_title", '') ||
to_tsvector('english', COALESCE("core_post"."post_subtitle", '') ||
to_tsvector('english', COALESCE("core_post"."post_text", '')
))));
但是当我执行 Django 查询时,它仍然很慢并且根本不使用索引!这是search_view_index
创建后查询的explain analyse
:
Limit (cost=26321.90..26323.63 rows=15 width=256) (actual time=620.819..622.468 rows=15 loops=1)
-> Gather Merge (cost=26321.90..26998.33 rows=5882 width=256) (actual time=620.818..622.465 rows=15 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Sort (cost=25321.89..25336.60 rows=5882 width=256) (actual time=618.137..618.139 rows=12 loops=2)
Sort Key: (ts_rank(((setweight(to_tsvector((COALESCE(post_title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector(COALESCE(post_subtitle, ''::text)), 'B'::"char")) || setweight(to_tsvector(COALESCE(post_text, ''::text)), 'C'::"char")), plainto_tsquery('Angel'::text))) DESC
Sort Method: top-N heapsort Memory: 33kB
Worker 0: Sort Method: top-N heapsort Memory: 33kB
-> Parallel Seq Scan on core_post (cost=0.00..25177.58 rows=5882 width=256) (actual time=2.856..617.963 rows=90 loops=2)
Filter: (ts_rank(((setweight(to_tsvector((COALESCE(post_title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector(COALESCE(post_subtitle, ''::text)), 'B'::"char")) || setweight(to_tsvector(COALESCE(post_text, ''::text)), 'C'::"char")), plainto_tsquery('Angel'::text)) >= '0.3'::double precision)
Rows Removed by Filter: 14910
Planning Time: 0.122 ms
Execution Time: 622.500 ms
我的猜测是我不知道如何正确创建索引。
如何在 PostgreSQL 中为该 Django 查询创建索引?
【问题讨论】:
'english' 需要作为 to_tsvector 的参数,而不是 setweight 的参数。此外,索引需要是一个 tsvector,而不是排名,(这只是一个数字)。并且在尝试对其进行排名之前,您需要测试 tsvector 是否与 @@ 匹配。 @jjanes 我得到了英文参数部分,我编辑了问题以显示我在应用您提交的更正后创建的索引。然而,当我执行 Django 查询时,它根本不使用索引!请检查我对上述问题的修改。 【参考方案1】:索引支持@@ 查询,不支持 ts_rank。没关系,因为您应该在尝试计算排名之前真正测试 @@ 匹配。
SELECT "core_post"."id", "core_post"."blog_name",
"core_post"."post_url", "core_post"."post_title", "core_post"."post_subtitle",
"core_post"."post_text",
ts_rank(((setweight(to_tsvector(COALESCE("core_post"."post_title", '')), 'A') ||
setweight(to_tsvector(COALESCE("core_post"."post_subtitle", '')), 'B')) ||
setweight(to_tsvector(COALESCE("core_post"."post_text", '')), 'C')),
plainto_tsquery('Angel'))
AS "rank" FROM "core_post" WHERE
to_tsvector('english', COALESCE("core_post"."post_title", '') ||
to_tsvector('english', COALESCE("core_post"."post_subtitle", '') ||
to_tsvector('english', COALESCE("core_post"."post_text", '')
@@plainto_tsquery('Angel')
ORDER BY "rank" DESC LIMIT 15
这将返回小于 0.3 的排名,只要不超过 15 个高于该排名。您可以添加更多代码来过滤掉它们,但实际上 0.3 非常高,所以我认为您最好不要这样做。否则即使您有合理的匹配项,您也可能根本找不到任何东西。
顺便说一句,看起来您的帖子数量超过了 3000 个。可能实际上是 30,000。
【讨论】:
以上是关于为 Django 全文搜索创建索引的主要内容,如果未能解决你的问题,请参考以下文章
Django:如何在 Postgresql 中对日语(多字节字符串)进行全文搜索
django使用全文搜索引擎haystack+jieba分词