在R中建立家庭嵌套树父/子关系

Posted

技术标签:

【中文标题】在R中建立家庭嵌套树父/子关系【英文标题】:Built Family nested tree parent / children relationship in R 【发布时间】:2018-04-02 07:56:30 【问题描述】:

我正在研究家谱:

我根据 sqldf https://www.r-bloggers.com/exploring-recursive-ctes-with-sqldf/ 改编了 Bob Horton 的例子

我的数据:

      person            father
      Guillou Arthur    NA          
      Cleach Marc       NA          
      Guillou Eric      Guillou Arthur          
      Guillou Jacques   Guillou Arthur          
      Cleach Franck     Cleach Marc         
      Cleach Leo        Cleach Marc         
      Cleach Herbet     Cleach Leo          
      Cleach Adele      Cleach Herbet           
      Guillou Jean      Guillou Eric            
      Guillou Alan      Guillou Eric

我的结果,按“Guillou Arthur”(没有父亲的***人物)等级排序的后代:

  name    parent_name              level
  Guillou Arthur    NA                  1       
  Guillou Eric      Guillou Arthur      2       
  Guillou Jacques   Guillou Arthur      2       
  Guillou Alan      Guillou Eric        3       
  Guillou Jean     Guillou Eric         3       

您可以使用 sqldf 递归查询构建此表:

数据:

 person <- c("Guillou Arthur",
              "Cleach Marc",
              "Guillou Eric",
              "Guillou Jacques", 
              "Cleach Franck",
              "Cleach Leo",
              "Cleach Herbet",
              "Cleach Adele",
              "Guillou Jean",
              "Guillou Alan" )
 father <- c(NA, NA, "Guillou Arthur" , "Guillou Arthur", "Cleach Marc", "Cleach Marc", "Cleach Leo", "Cleach Herbet", "Guillou Eric", "Guillou Eric")


family <- data.frame(person, father)

大到长格式转换:

    library(tidyr)

    long_family <- gather(family, parent, parent_name, -person)

    long_family

递归查询寻找“Guillou Arthur”(没有父亲的***人物)的后代:

    library(sqldf)
      descendants_sql <- "
      WITH RECURSIVE descendants (name, parent_name, level) AS (
        SELECT person, parent_name, 1 FROM long_family 
          WHERE person = '%s'
          AND parent = '%s'

          UNION ALL
          SELECT F.person, F.parent_name, D.level + 1 
              FROM descendants D
              JOIN long_family F
              ON F.parent_name = D.name)

      SELECT * FROM descendants ORDER BY level, name
      "
      fam <- sqldf(sprintf(descendants_sql, 'Guillou Arthur', 'father'))
      fam   

我的问题: 如何直接使用 R(而不是 sql)创建包含所有家谱的 data.frame 对象。 每棵树都以像“Cleach Marc”这样的族长(没有父亲)开头。 (用R方法或sqldf方法)

【问题讨论】:

【参考方案1】:

我们构建了一个递归函数来获取父行,从那里一切都很容易。

首先,我们使用stringsAsFactors = FALSE 定义数据,以便更顺畅地重新格式化。

family <- data.frame(person, father,stringsAsFactors = FALSE)

函数

father_line <- function(x)
dad <- subset(family,person==x)$father
if(is.na(dad)) return(x)
c(x,father_line(dad))


father_line ("Guillou Alan")
# [1] "Guillou Alan"   "Guillou Eric"   "Guillou Arthur"

用它来获得等级和其他东西

family$father_line <- lapply(family$person,father_line)
family$level       <- lengths(family$father_line)
family$patriarch   <- sapply(family$father_line,tail,1)

#             person         father                                          father_line level      patriarch
# 1   Guillou Arthur           <NA>                                       Guillou Arthur     1 Guillou Arthur
# 2      Cleach Marc           <NA>                                          Cleach Marc     1    Cleach Marc
# 3     Guillou Eric Guillou Arthur                         Guillou Eric, Guillou Arthur     2 Guillou Arthur
# 4  Guillou Jacques Guillou Arthur                      Guillou Jacques, Guillou Arthur     2 Guillou Arthur
# 5    Cleach Franck    Cleach Marc                           Cleach Franck, Cleach Marc     2    Cleach Marc
# 6       Cleach Leo    Cleach Marc                              Cleach Leo, Cleach Marc     2    Cleach Marc
# 7    Cleach Herbet     Cleach Leo               Cleach Herbet, Cleach Leo, Cleach Marc     3    Cleach Marc
# 8     Cleach Adele  Cleach Herbet Cleach Adele, Cleach Herbet, Cleach Leo, Cleach Marc     4    Cleach Marc
# 9     Guillou Jean   Guillou Eric           Guillou Jean, Guillou Eric, Guillou Arthur     3 Guillou Arthur
# 10    Guillou Alan   Guillou Eric           Guillou Alan, Guillou Eric, Guillou Arthur     3 Guillou Arthur

例如要获得规定的预期输出:

subset(family,patriarch == "Guillou Arthur",select=c(person,father,level))
#             person         father level
# 1   Guillou Arthur           <NA>     1
# 3     Guillou Eric Guillou Arthur     2
# 4  Guillou Jacques Guillou Arthur     2
# 9     Guillou Jean   Guillou Eric     3
# 10    Guillou Alan   Guillou Eric     3 

tidyverse 看起来像这样:

library(tidyverse)
family %>%
  mutate(family_line = map(person,father_line),
         level = lengths(family_line),
         patriarch = map(family_line,last)) %>%
  filter(patriarch == "Guillou Arthur") %>%
  select(person,father,level)

#            person         father level
# 1  Guillou Arthur           <NA>     1
# 2    Guillou Eric Guillou Arthur     2
# 3 Guillou Jacques Guillou Arthur     2
# 4    Guillou Jean   Guillou Eric     3
# 5    Guillou Alan   Guillou Eric     3

【讨论】:

感谢您的帮助 不客气,您的数据有多大?我的方法很简单,但效率不高,因为我要为每个人计算整行(因此多次计算同一件事)。 1000 - 5000 人! (我正在研究历史数据的第 18 名工人)我也想找到兄弟并根据级别创建一棵树(data.tree 包)。 感谢您提供对我来说更明确的 tidyverse 方式。 递归函数的逻辑一开始并不简单,对相同函数的调用是相互嵌入的,第一个调用是最后一个调用,第一个调用是最后一个调用(在“族长”)。插入打印语句可以帮助您了解发生了什么。如果您有具体问题,请不要犹豫。【参考方案2】:

您可能可以使用图形工具来做到这一点。所以使用igraph,你可以使用ego函数获取邻居。

速写(需要检查!)

library(igraph)

family[] = lapply(family, factor, levels=unique(unlist(family)))

g = graph_from_adjacency_matrix(table(family))

cg = connect.neighborhood(g, order=length(V(g)), mode="out")

cbind( V(cg)$name, 
       sapply(ego(g, mode="out", mindist=1), function(x) replace(names(x), length(names(x))==0, NA)),
       ego_size(cg, mode="out") )[grep("Guillou", V(cg)$name),]

[,1]                   [,2]             [,3]
[1,] "Guillou Arthur"  NA               "1" 
[2,] "Guillou Eric"    "Guillou Arthur" "2" 
[3,] "Guillou Jacques" "Guillou Arthur" "2" 
[4,] "Guillou Jean"    "Guillou Eric"   "3" 
[5,] "Guillou Alan"    "Guillou Eric"   "3"

事实上,也许你不需要创建邻域图并且可以通过:

cbind( V(g)$name, 
       sapply(ego(g, mode="out", mindist=1), function(x) replace(names(x), length(names(x))==0, NA)),
       ego_size(g, mode="out", order=length(V(g))) )[grep("Cleach", V(g)$name),]

【讨论】:

感谢您的速写。它工作得很好。你认为我可以使用 dplyr tidyr 方式吗? igraph 确实有 %>% 功能,所以可能吗??但我不使用 tidyr 语法 - 抱歉

以上是关于在R中建立家庭嵌套树父/子关系的主要内容,如果未能解决你的问题,请参考以下文章

RestKit - Nil 被映射到嵌套对象中以建立关系

嵌套查询与连接查询的区别是啥

线段树 动态开点

如何建立两个实体之间的关系 - Sugar ORM

如何在多个子实体和列表属性之间建立关系?

组件映射