低基数字段的索引效率

Posted 2023-02-16

技术标签:

【中文标题】低基数字段的索引效率【英文标题】：Efficiency of indexes for a field with low cardinality 【发布时间】：2021-09-19 04:17:13 【问题描述】：

例如，postgres 数据库中有一个字段（可以为空）存储枚举值，而该枚举只有两个值 A，B。

现在我的所有选择查询在该字段上都有 where 子句。

我有一个问题，向该字段添加索引将是一种好方法，否则它不会提高任何性能，因为每行都包含 A 或 B 或 null。

有什么方法可以提高所有 get 调用的性能。

请帮忙

【问题讨论】：

【参考方案1】：

在列中不经常出现NULL的情况下，您可以对表进行分区并在该字段在条件下使用时自动处理仅需要的部分而无需任何附加索引。

【讨论】：

【参考方案2】：

仅在该列上的索引不太可能有用，除非值的分布非常倾斜（例如 99% A、0.99% NULL、0.01% B）。但在这种情况下，您可能会更好地使用其他字段 WHERE this_field='B' 的部分索引。

但即使值的分布更均匀（33.33% A、33.33% NULL、33.33% B），在某些多列索引中将该列作为前导列包含在内可能很有用。例如，对于WHERE this_field='A' and other_field=7945，如果值分布均匀，(this_field, other_field) 上的索引通常比仅在(other_field) 上的索引效率高约 3 倍。

WHERE this_field='A' ORDER by other_field LIMIT 5 之类的东西可能会产生巨大的影响。使用(this_field, other_field) 上的索引，它可以直接跳转到索引中的正确位置，并按顺序读取前 5 行（通过可见性检查），然后停止。如果索引只是在 (other_field) 上，如果两列在统计上彼此不独立，则可能必须跳过任意数量的“B”或 NULL 行，然后才能找到带有“A”的 5 行。

【讨论】：

【参考方案3】：

没有。在大多数情况下，低基数列（或：一组低基数列）上的索引是无用的。相反，您可以使用条件索引。举个例子，我的推文 - 表，有几个布尔列：

twitters=# \d tweets
                           Table "public.tweets"
     Column     |           Type           | Collation | Nullable | Default 
----------------+--------------------------+-----------+----------+---------
 seq            | bigint                   |           | not null | 
 id             | bigint                   |           | not null | 
 user_id        | bigint                   |           | not null | 
 in_reply_to_id | bigint                   |           | not null | 0
 parent_seq     | bigint                   |           | not null | 0
 sucker_id      | integer                  |           | not null | 0
 created_at     | timestamp with time zone |           |          | 
 fetch_stamp    | timestamp with time zone |           | not null | now()
 is_dm          | boolean                  |           | not null | false
 is_reply_to_me | boolean                  |           | not null | false
 is_retweet     | boolean                  |           | not null | false
 did_resolve    | boolean                  |           | not null | false
 is_stuck       | boolean                  |           | not null | false
 need_refetch   | boolean                  |           | not null | false
 is_troll       | boolean                  |           | not null | false
 body           | text                     |           |          | 
 zoek           | tsvector                 |           |          | 
Indexes:
    "tweets_pkey" PRIMARY KEY, btree (seq)
    "tweets_id_key" UNIQUE CONSTRAINT, btree (id)
    "tweets_stamp_idx" UNIQUE, btree (fetch_stamp, seq)
    "tweets_du_idx" btree (created_at, user_id)
    "tweets_id_idx" btree (id) WHERE need_refetch = true
    "tweets_in_reply_to_id_created_at_idx" btree (in_reply_to_id, created_at) WHERE is_retweet = false AND did_resolve = false AND in_reply_to_id > 0
    "tweets_in_reply_to_id_fp" btree (in_reply_to_id)
    "tweets_parent_seq_fk" btree (parent_seq)
    "tweets_ud_idx" btree (user_id, created_at)
    "tweets_userid_id" btree (user_id, id)
    "tweets_zoek" gin (zoek)
Foreign-key constraints:
...

“tweets_in_reply_to_id_created_at_idx”索引仅包含满足条件的行的条目。一旦引用被重新获取（或未能这样做），它们就会从索引中删除。所以，这个索引通常只有几个 pending 记录。

另一个例子：gender 列。您会期望男性/女性的分布为 50/50。假设行大小约为 100，则 8K 页面上有约 70 行。可能会在同一页面上同时存在男性和女性，因此即使搜索仅男性或女性也需要阅读所有页面。（需要读取索引会使情况恶化，但优化器会明智地决定忽略索引）聚集索引可能有帮助，但需要大量维护工作。不值得。

【讨论】：

我同意 wildplasser。在我看来，索引的存在可以帮助快速find 列。为具有有用的分布值的字段保存索引，以便使用该索引的搜索将快速将搜索区域减少到更小的行子集。双值字段上的索引永远不会“支付运费”。索引有助于找到页面。稍后会提取记录。

以上是关于低基数字段的索引效率的主要内容，如果未能解决你的问题，请参考以下文章