如何递归地创建谱系并查找匹配以进行近亲繁殖检测 (Oracle)

Posted 2023-05-09

技术标签:

【中文标题】如何递归地创建谱系并查找匹配以进行近亲繁殖检测 (Oracle)【英文标题】：How to recursivly create pedigrees and find matches for inbreeding detection (Oracle) 【发布时间】：2020-05-05 13:36:01 【问题描述】：

我很难在 Oracle 中创建一个函数来确定 2 种交配动物是否会产生近亲繁殖。函数需要3个参数：男性ID、女性ID和要查看的深度。

起初我认为我应该使用具有如下结构的表中的数据创建两个谱系：

    TABLE animal
    +-----+---------+--------+
    | ID  | SIRE_ID | DAM_ID |
    +-----+---------+--------+
    | 111 | 112     | 212    |
    | 112 | 113     | 213    |
    | 212 | 116     | 216    |
    +-----+---------+--------+

（不完全相关，但对于这个和未来的例子，我使用 ID-s 作为 1?? 是男性和 2?? 是女性。）为此，我应该使用深度参数 - 可能是递归的。

这是我目前所拥有的：

function animal_pedigree (p_id number,
    p_max_pedigree_level number,
    p_pedigree_level number := 0,
    p_position varchar2 := '') return animal_ancestors_table
    pipelined
is
    v_sire_id number;
    v_dam_id number;
        v_row animal_ancestor;
begin
        v_row.id := p_id;
        v_row.pedigree_level := p_pedigree_level;
        v_row.position := p_position;
    pipe row (v_row);
    if p_pedigree_level < p_max_pedigree_level then
        select sire_id, dam_id
        into v_sire_id, v_dam_id
        from arc.animal
        where id = p_id;
        if v_sire_id is not null then
            for rec in (select id, pedigree_level, position
                from table(animal_pedigree (v_sire_id, p_max_pedigree_level, p_pedigree_level+1, p_position || 's'))) loop
                                v_row.id := rec.id;
                                v_row.pedigree_level := rec.pedigree_level;
                                v_row.position := rec.position;
                pipe row (v_row);
            end loop;
        end if;
        if v_dam_id is not null then
            for rec in (select id, pedigree_level, position
                from table(animal_pedigree (v_dam_id, p_max_pedigree_level, p_pedigree_level+1, p_position || 'd'))) loop
                                v_row.id := rec.id;
                                v_row.pedigree_level := rec.pedigree_level;
                                v_row.position := rec.position;
                pipe row (v_row);
            end loop;
        end if;
    end if;
    return;
end;

接下来是我最棘手的部分：比较谱系以找到匹配的 ID（并记住找到匹配的深度）。

最终我想返回发现近亲繁殖的最小深度，或者返回 0（如果没有发现）。

注意！我只想比较两个谱系，而不是比较一个内部的 ID。（如果已经存在近亲繁殖，我希望它被忽略，只对新形成的近亲繁殖感兴趣。）

为了进一步说明，我添加了 3 个示例。标有 *（星号）的匹配项。

示例 1：

男性血统

Depth           1       2       3

                            |--114
                    |--113--|
                    |       |--214
            |--112--|       
            |       |       |--115
            |       |--213--|
            |               |--215
       111--|
            |               |--117
            |       |--116--|
            |       |       |--217
            |--212--|
                    |       |--118
                    |--216--|
                            |--218

女性血统

Depth           1       2       3

                            |--124
                    |--123--|
                    |       |--224
            |--122--|       
            |       |       |--125
            |       |--223--|
            |               |--225
       211--|
            |               |--127
            |       |--126--|
            |       |       |--227
            |--222--|
                    |       |--128
                    |--226--|
                            |--228

[RETURN 0] 未找到相同的 ID

示例 2：

男性血统

Depth           1       2       3

                            |--114*
                    |--113--|
                    |       |--214
            |--112--|       
            |       |       |--115
            |       |--213--|
            |               |--215
       111--|
            |               |--117
            |       |--116--|
            |       |       |--217
            |--212--|
                    |       |--114*
                    |--216--|
                            |--218

女性血统

Depth           1       2       3

                            |--124
                    |--123--|
                    |       |--224
            |--122--|       
            |       |       |--125
            |       |--223--|
            |               |--225
       211--|
            |               |--127
            |       |--126--|
            |       |       |--227
            |--222--|
                    |       |--128
                    |--226--|
                            |--228

[RETURN 0] 匹配的 ID-s 都在男性谱系中找到。忽略。

示例 3：

男性血统

Depth           1       2       3

                            |--114*
                    |--113--|
                    |       |--214
            |--112--|       
            |       |       |--115
            |       |--213--|
            |               |--215
       111--|
            |               |--117
            |       |--116--|
            |       |       |--217
            |--212--|
                    |       |--118
                    |--216--|
                            |--218

女性血统

Depth           1       2       3

                            |--124
                    |--123--|
                    |       |--224
            |--122--|       
            |       |       |--125
            |       |--223--|
            |               |--225
       211--|
            |               |--127
            |       |--114*-|
            |       |       |--227
            |--222--|
                    |       |--128
                    |--226--|
                            |--228

[RETURN 2] 在男性谱系的深度 3 和女性谱系的深度 2 中找到匹配 ID

【问题讨论】：

您的示例有助于理解问题。为了帮助我们所有人测试我们的暂定解决方案，添加create table as select ... 语句以在实际表中重现示例数据也会有所帮助。您可以使用递归 CTE 找到结果。它应该比程序简单得多。抱歉，忘了说我正在运行 oracle 10(g) 【参考方案1】：

由于您使用的是 10g 而不是更新版本，因此您需要使用 oracle 的分层查询而不是 The Impaler 显示的递归公用表表达式。为了使我的解决方案正常工作，将动物性别编码为单独的列而不是将其嵌入动物 ID 中会很有帮助，因此我将使用下表定义。（注意：我没有 10g 的实例来尝试这个，所以我不确定 10g 中是否有可递减约束。如果不只是删除这些子句。它们使加载示例数据变得更容易。）：

CREATE TABLE animal
    ( ID number not null primary key
    , GENDER varchar2(1) not null
    , SIRE_ID number
    , DAM_ID number
    , constraint animal_gender check (gender in ('M','F'))
    , constraint animal_sire_fk FOREIGN KEY (sire_id) REFERENCES animal(id) DEFERRABLE INITIALLY DEFERRED
    , constraint animal_dam_fk FOREIGN KEY (dam_id) REFERENCES animal(id) DEFERRABLE INITIALLY DEFERRED
    );

从那里生成任何给定动物的所有祖先的映射很有帮助，这称为闭包表，如果需要，您可以在谷歌上搜索更多关于闭包表的信息。这可以通过递归 SQL 或在我们的例子中使用 oracle 分层表来完成，因为您使用的是 10g。在这个例子中，我将它命名为 Ancestry：

with ancestry as (
select CONNECT_BY_ROOT id id
     , CONNECT_BY_ROOT gender gender
     , id ancestor_id
     , gender ancestor_gender
     , level-1 lvl
  from animal
  connect by id in (prior sire_id, prior dam_id)
)

从那里您可以通过适度简单的连接找到所有具有共同祖先的动物：

select m.id sire_id
     , f.id dam_id
     , m.ancestor_id
     , m.ancestor_gender
     , m.lvl sire_lvl
     , f.lvl dam_lvl
  from ancestry m
  join ancestry f
    on m.ancestor_id = f.ancestor_id
   and m.gender = 'M'
   and f.gender = 'F';

该查询列出了所有成对的雄性和雌性动物，以及它们所有的共同祖先。这有点多，我们希望将其缩减为您感兴趣的配对，并将其限制为仅第一个共同祖先。为此，我们将添加一个 where 子句，将其限制为感兴趣的配对，并使用聚合使我们只找到第一个祖先：

select m.id sire_id
     , f.id dam_id
     , max(m.ancestor_id) keep (dense_rank first order by least(m.lvl,f.lvl)) first_ancestor
     , max(m.ancestor_gender) keep (dense_rank first order by least(m.lvl,f.lvl)) ancestor_gnder
     , min(m.lvl) sire_lvl
     , min(f.lvl) dam_lvl
  from ancestry m
  join ancestry f
    on m.ancestor_id = f.ancestor_id
   and m.gender = 'M'
   and f.gender = 'F'
 where (m.id, f.id) in ((111,211))
 group by m.id, f.id;

将所有这些放在一起是最终的查询：

with ancestry as (
select CONNECT_BY_ROOT id id
     , CONNECT_BY_ROOT gender gender
     , id ancestor_id
     , gender ancestor_gender
     , level-1 lvl
  from animal
  connect by id in (prior sire_id, prior dam_id)
)
select m.id sire_id
     , f.id dam_id
     , max(m.ancestor_id) keep (dense_rank first order by least(m.lvl,f.lvl)) first_ancestor
     , max(m.ancestor_gender) keep (dense_rank first order by least(m.lvl,f.lvl)) ancestor_gnder
     , min(m.lvl) sire_lvl
     , min(f.lvl) dam_lvl
  from ancestry m
  join ancestry f
    on m.ancestor_id = f.ancestor_id
   and m.gender = 'M'
   and f.gender = 'F'
 where (m.id, f.id) in ((111,211))
 group by m.id, f.id;

您可以在小提琴示例中通过db<>fiddle 看到它的实际作用2 和 3，因此例如 2 仅添加了男性线，而在示例 3 中仅添加了女性线，产生以下配对 (1111, 1121), (2111, 1211) 和 (1111,3211) 对于示例 1，分别为 2 和 3。

这只会在这对动物有共同祖先时返回记录。它还预先生成了整个祖先关闭，这对于大型邻接列表可能很耗时。为了更有效的查询，祖先关闭可以仅限于具有 START 条件的感兴趣的动物。此外，可以将搜索深度限制在两个位置之一（或两者），在输出查询中的 where 子句或祖先查询中的 where 子句中。此外，为了满足您的要求，即当没有共同祖先时，配对返回显示零级别的行，需要进行一些细微的修改。首先，需要修改祖先 CTE 以使自连接具有空级别（深度为零）。这对于使聚合工作很重要。然后需要稍微更新聚合列和连接条件以允许没有共同祖先的记录。这是修改后的查询：

with ancestry as (
select CONNECT_BY_ROOT id id
     , CONNECT_BY_ROOT gender gender
     , id ancestor_id
     , gender ancestor_gender
     , case level when 1 then null else level-1 end lvl
     , level-1 lvl0
  from animal

 -- Limit depth to 3 generations
 where level-1 <= 3 

 connect by id in (prior sire_id, prior dam_id)

 -- Only build ancestry closure for these animals
 start with id in (1111,1211,2111,3211)
)
select m.id sire_id
     , f.id dam_id
     , max(nvl2(m.lvl,m.ancestor_id,null)) keep (dense_rank first order by least(m.lvl,f.lvl) nulls last) first_ancestor
     , max(nvl2(m.lvl,m.ancestor_gender,null)) keep (dense_rank first order by least(m.lvl,f.lvl) nulls last) ancestor_gnder
     , nvl(min(m.lvl),0) sire_lvl
     , nvl(min(f.lvl),0) dam_lvl
  from ancestry m
  join ancestry f
    on (m.ancestor_id = f.ancestor_id or (m.id, f.id) in ((m.ancestor_id, f.ancestor_id)))
   and m.gender = 'M'
   and f.gender = 'F'
 where (m.id, f.id) in ((1111,1211) -- First example no common ancestors
                       ,(2111,1211) -- 2nd ex common ancesters in male line
                       ,(1111,3211))-- 3rd ex common ancestry of sire & dam

   -- Limit to at most 3 generations
   and greatest(m.lvl0, f.lvl0) <= 3

 group by m.id, f.id;

【讨论】：

我不明白。这不会在 SQL Fiddle 或我的数据库中给出任何结果。它只是继续运行……原因可能是它试图建立整个血统吗？我的数据库非常大（数百万只动物），祖先的深度有 10 甚至更多。也许我读错了查询，但我没有看到所构建祖先深度的限制，它是否在从中选择之前为所有动物创建了祖先？我不确定为什么 SQL Fiddle 不起作用，我现在也遇到了问题，所以也许这是他们的问题。除此之外，是的，当前查询构建了整个祖先封闭，并且也不限制搜索的深度。您可以通过添加深度限制并限制它搜索的动物来优化祖先。我已经用这些增强功能更新了上面的最后一个查询。另一种选择是将祖先闭包创建为物化视图，这样就不需要为每个查询重新构建它。我将示例从 SQL Fiddle 移到了 dbfiddle，这似乎更稳定，至少目前如此。谢谢哨兵。该解决方案有效，但我认为我必须利用您构建物化查询的想法，即使深度为 4，并将查询限制为仅 2 只动物，大约需要 10 秒才能获得结果。在我的例子中，我需要在一个页面上用结果数字填充大约一百个配对的列表。所以我会接受你的回答，因为它给出了预期的结果。感谢您引导我走向正确的道路:)【参考方案2】：

您可以使用递归 CTE 来查找匹配的祖先。

此示例未经测试，因为您没有提供创建示例数据的脚本。无论如何，这个查询应该可以工作：

with
l (id, aid, sire_id, dam_id, lvl) as (
  select id, id, sire_id, dam_id, 0 from animal where id = 111 -- male ID
  union all
  select l.id, a.id, l.lvl + 1
  from l
  join animal a on a.id in (l.sire_id, l.dam_id)
),
r (id, aid, sire_id, dam_id, lvl) as (
  select id, id, sire_id, dam_id, 0 from animal where id = 211 -- female ID
  union all
  select r.id, a.id, r.lvl + 1
  from r
  join animal a on a.id in (r.sire_id, r.dam_id)
)
select 
  l.id as male_id, l.aid as male_ancestor_id, l.lvl as male_ancestor_depth,
  r.id as female_id, r.aid as female_ancestor_id, r.lvl as female_ancestor_depth
from l
join r on r.aid = l.aid

此查询返回所有匹配项（可以有多个）及其所有组合。您可以添加额外的更改来删除重复的匹配项，因为动物可以是每棵树上已有多个动物的祖先。

此外，主查询显示匹配的所有详细信息，包括匹配的祖先及其对应的深度。您可以轻松修改它以仅显示深度（如您所愿）。或者...您可以扩展它以向您显示到达每个祖先的“完整路径”。这取决于您喜欢的确切输出。我敢打赌，一旦你看到结果，你就会想知道更多关于它的信息。

【讨论】：

抱歉，忘了说我正在运行 oracle 10(g)

以上是关于如何递归地创建谱系并查找匹配以进行近亲繁殖检测 (Oracle)的主要内容，如果未能解决你的问题，请参考以下文章