MariaDB - 为啥主键不用于特定表的连接？

Posted 2023-02-24

技术标签:

【中文标题】MariaDB - 为啥主键不用于特定表的连接？【英文标题】：MariaDB - why are the primary keys not being used for joins on a specific table?MariaDB - 为什么主键不用于特定表的连接？ 【发布时间】：2021-04-17 21:02:48 【问题描述】：

我试图理解为什么这两个查询在连接中使用主键时会受到不同的处理。

这个在icd_codes 上连接的查询（SELECT 查询，当然没有EXPLAIN）在 56 毫秒内完成：

EXPLAIN
SELECT var.Var_ID,
       var.Gene,
       var.HGVSc,
       pVCF_145K.PT_ID,
       pVCF_145K.AD_ALT,
       pVCF_145K.AD_REF,
       icd_codes.ICD_NM,
       icd_codes.PT_AGE
FROM public.variants_145K var
         INNER JOIN public.pVCF_145K USING (Var_ID)
         INNER JOIN public.icd_codes using (PT_ID)
#          INNER JOIN public.demographics USING (PT_ID)
WHERE Gene IN ('SLC9A6', 'SLC9A7')
  AND Canonical
  AND impact = 'high'
+------+-------------+-----------+-------+------------------------------------------------------------------+---------------------------------+---------+------------------------+------+------------------------------------+
| id   | select_type | table     | type  | possible_keys                                                    | key                             | key_len | ref                    | rows | Extra                              |
+------+-------------+-----------+-------+------------------------------------------------------------------+---------------------------------+---------+------------------------+------+------------------------------------+
|    1 | SIMPLE      | var       | range | PRIMARY,variants_145K_Gene_index,variants_145K_Impact_Gene_index | variants_145K_Impact_Gene_index | 125     | NULL                   | 280  | Using index condition; Using where |
|    1 | SIMPLE      | pVCF_145K | ref   | PRIMARY,pVCF_145K_PT_ID_index                                    | PRIMARY                         | 326     | public.var.Var_ID      | 268  |                                    |
|    1 | SIMPLE      | icd_codes | ref   | PRIMARY                                                          | PRIMARY                         | 38      | public.pVCF_145K.PT_ID | 29   |                                    |
+------+-------------+-----------+-------+------------------------------------------------------------------+---------------------------------+---------+------------------------+------+------------------------------------+

这个在demographics 上加入的查询需要 11 分钟，我不确定如何解释解释结果中的差异。为什么要使用连接缓冲区？如何进一步优化？

EXPLAIN
SELECT variants_145K.Var_ID,
       variants_145K.Gene,
       variants_145K.HGVSc,
       pVCF_145K.PT_ID,
       pVCF_145K.AD_ALT,
       pVCF_145K.AD_REF,
       demographics.Sex,
       demographics.Age
FROM public.variants_145K
         INNER JOIN public.pVCF_145K USING (Var_ID)
#          inner join public.icd_codes using (PT_ID)
         INNER JOIN public.demographics USING (PT_ID)
WHERE Gene IN ('SLC9A6', 'SLC9A7')
  AND Canonical
  AND impact = 'high'
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+---------+------------------------------------+
| id   | select_type | table         | type   | possible_keys                                                    | key                             | key_len | ref                                                   | rows    | Extra                              |
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+---------+------------------------------------+
|    1 | SIMPLE      | variants_145K | range  | PRIMARY,variants_145K_Gene_index,variants_145K_Impact_Gene_index | variants_145K_Impact_Gene_index | 125     | NULL                                                  | 280     | Using index condition; Using where |
|    1 | SIMPLE      | demographics  | ALL    | PRIMARY                                                          | NULL                            | NULL    | NULL                                                  | 1916393 | Using join buffer (flat, BNL join) |
|    1 | SIMPLE      | pVCF_145K     | eq_ref | PRIMARY,pVCF_145K_PT_ID_index                                    | PRIMARY                         | 364     | public.variants_145K.Var_ID,public.demographics.PT_ID | 1       |                                    |
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+---------+------------------------------------+

在demographics (WHERE demographics.Platform IS NOT NULL) 中添加进一步的过滤器，如下所示减少到 38 秒。但是，有些查询我们不使用此类过滤器，因此如果它可以在连接中使用主 PT_ID 键，那将是理想的。

+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+--------+------------------------------------------------------------------------+
| id   | select_type | table         | type   | possible_keys                                                    | key                             | key_len | ref                                                   | rows   | Extra                                                                  |
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+--------+------------------------------------------------------------------------+
|    1 | SIMPLE      | variants_145K | range  | PRIMARY,variants_145K_Gene_index,variants_145K_Impact_Gene_index | variants_145K_Impact_Gene_index | 125     | NULL                                                  | 280    | Using index condition; Using where                                     |
|    1 | SIMPLE      | demographics  | range  | PRIMARY,Demographics_PLATFORM_index                              | Demographics_PLATFORM_index     | 17      | NULL                                                  | 258544 | Using index condition; Using where; Using join buffer (flat, BNL join) |
|    1 | SIMPLE      | pVCF_145K     | eq_ref | PRIMARY,pVCF_145K_PT_ID_index                                    | PRIMARY                         | 364     | public.variants_145K.Var_ID,public.demographics.PT_ID | 1      |                                                                        |
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+--------+------------------------------------------------------------------------+

表格：

create table public.demographics  # 1,916,393 rows
(
    PT_ID varchar(9) not null
        primary key,
    Age float(3,1) null,
    Status varchar(8) not null,
    Sex varchar(7) not null,
    Race_1 varchar(41) not null,
    Race_2 varchar(41) not null,
    Ethnicity varchar(22) not null,
    Smoker_flag tinyint(1) not null,
    Platform char(4) null,
    MyCode_Consent tinyint(1) not null,
    MR_ENC_DT date null,
    Birthday date null,
    Deathday date null,
    max_unrelated_145K tinyint unsigned null
);
create index Demographics_PLATFORM_index
    on public.demographics (Platform);

create table public.icd_codes  # 116,220,141 rows
(
    PT_ID varchar(9) not null,
    ICD_CD varchar(8) not null,
    ICD_NM varchar(217) not null,
    DX_DT date not null,
    PT_AGE float(3,1) unsigned not null,
    CODE_SYSTEM char(7) not null,
    primary key (PT_ID, ICD_CD, DX_DT)
);

create table public.pVCF_145K  # 10,113,244,082 rows
(
    Var_ID varchar(81) not null,
    PT_ID varchar(9) not null,
    GT tinyint unsigned not null,
    GQ smallint unsigned not null,
    AD_REF smallint unsigned not null,
    AD_ALT smallint unsigned not null,
    DP smallint unsigned not null,
    FT varchar(30) null,
    primary key (Var_ID, PT_ID)
);
create index pVCF_145K_PT_ID_index
    on public.pVCF_145K (PT_ID);

create table public.variants_145K  # 151,314,917 rows
(
    Var_ID varchar(81) not null,
    Gene varchar(22) null,
    Feature varchar(18) not null,
    Feature_type varchar(10) null,
    HIGH_INF_POS tinyint(1) null,
    Consequence varchar(26) not null,
    rsid varchar(34) null,
    Impact varchar(8) not null,
    Canonical tinyint(1) not null,
    Exon smallint unsigned null,
    Intron smallint unsigned null,
    HGVSc varchar(323) null,
    HGVSp varchar(196) null,
    AA_position smallint unsigned null,
    gnomAD_NFE_MAF float null,
    SIFT varchar(14) null,
    PolyPhen varchar(17) null,
    GHS_Hom mediumint(5) unsigned null,
    GHS_Het mediumint(5) unsigned null,
    GHS_WT mediumint(5) unsigned null,
    IDT_MAF float null,
    VCR_MAF float null,
    UKB_MAF float null,
    Chr tinyint unsigned not null,
    Pos int(9) unsigned not null,
    Ref varchar(298) not null,
    Alt varchar(306) not null,
    primary key (Var_ID, Feature)
);
create index variants_145K_Chr_Pos_Ref_Alt_index
    on public.variants_145K (Chr, Pos, Ref, Alt);

create index variants_145K_Gene_index
    on public.variants_145K (Gene);

create index variants_145K_Impact_Gene_index
    on public.variants_145K (Impact, Gene);

create index variants_145K_rsid_index
    on public.variants_145K (rsid);

这是在 MariaDB 10.5.8 (innodb) 上

谢谢！

【问题讨论】：

添加variants_145K的定义 【参考方案1】：

INDEX(impact, canonical, gene) 或INDEX(canonical, impact, gene) 更适合var。

如果您不需要它，请删除 INNER JOIN public.icd_codes USING (PT_ID)。访问该表的成本很高，它所做的只是过滤掉JOIN 中失败的所有行。

demographics 同上。

“加入缓冲区”并不总是“求助于”；但是，这通常是一种快速的方式。尤其是在需要大部分表并且 join_buffer 足够大的情况下。

请注意，demographics 有一个单列 PRIMARY KEY(PT_ID)，但另一个表有一个复合 PK。这可能会影响优化器是否会考虑使用“连接缓冲区”。

根据很多事情（在查询和数据中），优化器可能会在 join_buffer 和重复查找之间做出错误的选择。

【讨论】：

谢谢！我们经常在查询中包含 icd_codes 和人口统计表中的其他列；为了简洁起见，我只是在这里将它们删掉，因为问题是由于 PT_ID 键未用于人口统计表连接，而它用于 icd_codes 表连接。为什么（影响，基因）更好？我的想法是我们经常使用没有影响的基因进行过滤，而当我们包含影响过滤器时，它总是伴随着基因。我的主要困惑是，为什么在加入 icd_codes 表时使用主键 PT_ID，而不是人口统计数据？ @JonLuo - 由于您也在 gene 上搜索而没有 impact，因此同时拥有 (impact, gene) 和 (gene)。这是一个相关的讨论（虽然IN与“范围”不太一样）：***.com/questions/50239658/… @JonLuo - and... 索引的列从左到右使用，没有跳过。如果您需要跳过，那么索引的其余部分就没有用（对于有问题的查询）。 @JonLuo - 并且...对于您的最后一个问题，我已添加到我的答案中。感谢 Rick，将索引更改为 (impact, gene)。您建议如何优化 demographics 上的连接？ icd_codes 的示例查询在 56 毫秒内完成，而 demographics 的查询需要大约 11 分钟。我想知道这是否是由于从 MariaDB 10.3 更新到 10.5 引起的……过去使用 demographics 的类似查询非常快。

以上是关于MariaDB - 为啥主键不用于特定表的连接？的主要内容，如果未能解决你的问题，请参考以下文章