dplyr、lapply 或 Map 以识别来自一个 data.frame 的信息并将其放入另一个 [重复]

Posted 2023-04-18

技术标签:

【中文标题】dplyr、lapply 或 Map 以识别来自一个 data.frame 的信息并将其放入另一个 [重复]【英文标题】：dplyr, lapply, or Map to identify information from one data.frame and place it into another [duplicate] 【发布时间】：2016-06-30 02:20:38 【问题描述】：

编辑：

对不起，我不是想重新发布问题。我遇到的问题不仅仅是连接两个表，而是连接两个表的列在两个表中并不完全相同（我更新了示例数据来说明这一点）。也就是说，我想 pmatch 或 str_detect Test.Takers$First 列中的字符串与 Every.Student.In.The.Country$First 列。我不确定如何在 left_join 中加入 pmatch 或 str_detect。如果您能将我指向涵盖此内容的 SO 文章，那么我将不胜感激。我的编码术语仍然很差，所以我输入的任何查询都没有让我找到任何有用的东西。

无论如何，我最终弄清楚了如何在我的 data.frames 上使用 lapply：事实证明，我所要做的就是将 data.frame 的每一行转换为一个单独的列表项。我不得不调整“matching_name_one_row”函数，使其只有一个输入才能使其工作。它实际上比其他两个代码慢得多：'0(

matching_name_one_row <- function(student_df) 
    require(dplyr)
    require(stringr)

    indexmp <- Every.Student.In.The.Country %>% filter(Paternal == as.character(student_df$Paternal), Maternal == as.character(student_df$Maternal))
    id_num <- indexmp$id_num[str_detect(indexmp$First, as.character(student_df$First))]
    return(id_num[1])



rowlist <- list()
for(i in 1:nrow(Test.Takers)) rowlist[[i]]<- Test.Takers[i,]
Test.Takers$id_num <- unlist(lapply(rowlist, matching_name_one_row))

原始问题（包含更新数据）：

Test.Takers <- data.frame(
    Paternal = c('Last', 'Last','Last', 'Paternal', 'Paternal', "Father's Name"),
    Maternal = c('Maternal', 'Maternal', 'Last', 'Maternal', 'Last', "Mother's Name"),
    First = c('First', 'Name', 'First', 'Name', 'First', 'BEE'),
    id_num = NA,
    stringsAsFactors = F)

Every.Student.In.The.Country <- data.frame(
    Paternal = c('Last', 'Last', 'Last', 'Paternal', 'Paternal', 'Paternal', "Father's Name"),
    Maternal = c('Maternal', 'Last', 'Last', 'Maternal', 'Last', 'Maternal', "Mother's Name"),
    First = c('First', 'Name', 'First', 'Name', 'First', 'Something Else', 'BEEMYFRIEND'),
    id_num = c(123, 456, 789, 234, 567, 890, 101),
    stringsAsFactors = F)

我有两个相似的 data.frames。第一个 data.frame 包含约 30000 个没有 id_nums 的名称以及我省略的许多其他变量。第二个 data.frame 包含约 12000000 个具有 id_nums 的名称。我想通过匹配两个 data.frame 中的名称（Paternal、Maternal 和 First）来用 id_nums 填充第一个 data.frame。

我提出了两种解决方案，但它们都很慢。最慢但最容易阅读的代码是：

matching_name_one_row <- function(student_df, citizen_df) 
    require(dplyr)
    require(stringr)

    indexmp <- citizen_df %>% filter(Paternal == as.character(student_df$Paternal), Maternal == as.character(student_df$Maternal))
    id_num <- indexmp$id_num[str_detect(indexmp$First, as.character(student_df$First))]
    return(id_num[1])


for(i in 1:nrow(Test.Takers)) Test.Takers$id_num[i] = matching_name_one_row(Test.Takers[i,],Every.Student.In.The.Country)

上面的函数（matching_name_one_row）只接受来自Test.Takers data.frame的一行信息。我以这种方式创建它是因为我认为它可以更容易地在 Map() 或 lapply() 函数中使用该函数。但是，我仍然不太了解 Map 或 lapply，所以我不得不使用我上面编写的代码。太慢了……

下面是（稍微）更快，但更烦人的代码：

adding_id <- function(student_df, citizen_df)

  require(dplyr)
  require(stringr)

  #Will hold subsets of last names
  indexp <- data.frame(Paternal='name')
  indexm <- data.frame(Maternal='name')

  for(i in 1:nrow(student_df)) 

    #Last names of current observation
    namep <- student_df$Paternal[i]
    namem <- student_df$Maternal[i]

    #Prevents from iterating through the entire citizen_df unnecessarily
    if(is.na(as.character(indexp$Paternal[1])) == T | as.character(indexp$Paternal[1]) != namep) 

      indexp <- citizen_df %>% filter(Paternal == as.character(student_df$Paternal[i]))

    

    #Error occurs when a name does not exist in the citizen file
    if(is.na(indexp$Paternal[1]) == F) 

      #Prevents from iterating through the entire citizen_df unnecessarily
      if(is.na(as.character(indexm$Maternal[1])) == T | as.character(indexm$Maternal[1]) != namem) 

        indexm <- indexp %>% filter(Maternal == as.character(student_df$Maternal[i]))

      

      #Attach id_num if there is a partial string match for the first name
      student_df$id_num[i] <- indexm$id_num[str_detect(indexm$First, as.character(student_df$First[i]))][1]

    

  

  #creates a df for students with id_num found and not found
  id_found <<- student_df %>% filter(is.na(id_num)==F)
  id_not_found <<- student_df %>% filter(is.na(id_num)==T)

这两个代码都有效，但至少需要 11 小时才能完成。我很肯定有更快的方法通过使用 dplyr、lapply 和 Map 来完成相同的事情。例如，我知道 dplyr 有可能用于这种变量匹配的两表动词，我只是不知道如何实现两表动词。请帮帮我。

【问题讨论】：

【参考方案1】：

你在正确的轨道上。 dplyr 专为此类问题而设计。您将想要研究连接函数，但对于您所描述的 left_join 应该是正确的版本。

library(dplyr)
left_join(Test.Takers, Every.Student.In.The.Country, by=c("Paternal", "Maternal", "First"))

现在这会将 Every.Student.In.The.Country 数据框中的 id 列添加到 Test.Takers 数据框中。

【讨论】：

这非常有用，而且绝对有很大帮助！谢谢！但是，有时两个数据框中的“名字”并不完全匹配。我想将“名字”与 pmatch 或 str_detect 之类的函数相匹配。有没有办法可以将该函数放入 left_join::by 中？我不知道这是否可能， pmatch 有可能返回多个结果，这是一个挑战。由于此问题已被标记为重复，因此不太可能有额外的 cmets。我建议以此为起点创建一个新问题。我进行了编辑。如果我没有得到更多回复，那么我将写一个新问题。谢谢你的建议！（我希望通过仅使用 pmatch/str_detect 的第一个输出来解决有关多个输出的问题：）

以上是关于dplyr、lapply 或 Map 以识别来自一个 data.frame 的信息并将其放入另一个 [重复]的主要内容，如果未能解决你的问题，请参考以下文章