对包含因子、NA 值和通配符的数据框进行子集化

Posted 2023-02-14

技术标签:

【中文标题】对包含因子、NA 值和通配符的数据框进行子集化【英文标题】：Subsetting a data frame containing factors, NA values, and wildcards 【发布时间】：2022-01-17 01:57:13 【问题描述】：

所以我有一个包含几个不同类别的大型数据框，下面是一个简化示例（真正的数据集有 10 多个不同的组织，15 多个不同的独特细胞类型，每个组织具有可变长度的名称，以及数千个基因）。组织列被格式化为因子。

GENENAME    Tissue1     Tissue2     Tissue3
Gene1       CellType_AA CellType_BB CellType_G
Gene2       CellType_AA CellType_BB       <NA>
Gene3       CellType_AA       <NA>        <NA>
Gene4       CellType_AA CellType_BB CellType_G
Gene5             <NA>        <NA>  CellType_G
Gene6             <NA>  CellType_BB CellType_H
Gene7       CellType_AC CellType_BD CellType_H
Gene8             <NA>        <NA>  CellType_H
Gene9       CellType_AC CellType_BD       <NA>
Gene10            <NA>  CellType_BB       <NA>
Gene11            <NA>  CellType_BD CellType_H
Gene12      CellType_AC       <NA>        <NA>
Gene13            <NA>  CellType_E  CellType_I
Gene14      CellType_F  CellType_E  CellType_I
Gene15      CellType_F  CellType_E        <NA>

我想要做的是根据多个组织中存在的 CellTypes 返回一个子集，并在我这样做时忽略不必要的列。此外，我想使用通配符（在下面的示例中，CellType_A*，以便同时选择CellType_AA 和CellType_AB），并在我只指定一些列时忽略其他列。我希望该函数可以轻松地用于不同的细胞类型组合，因此为每一列添加了一个单独的变量。

为此，我设置了下面的函数，将每个变量的默认值设置为"*"，认为如果我不指定输入，它会将这些列中的任何一个视为有效。

Find_CoEnrich <- function(T1="*", T2="*", T3="*")
  subset(dataset, 
         grepl(T1, dataset$Tissue1)
         &grepl(T2, dataset$Tissue2)
         &grepl(T3, dataset$Tissue3)
         ,select = GENENAME
  )

但是，当我只在单个列上运行该函数时，要对其进行测试

Find_CoEnrich(T1="CellType_AA")

它只会返回以下内容：

   GENENAME
1     Gene1
4     Gene4

而不是

1     Gene1
2     Gene2
3     Gene3
4     Gene4

跳过另一列中包含NA 的任何行。更神秘的是，如果我尝试使用通配符，它似乎会忽略字符串的其余部分，只返回每行中都有值的行，即使它们与字符串的其余部分不匹配，如 Gene14 ：

Find_CoEnrich(T1="CellType_A*")

   GENENAME
1     Gene1
4     Gene4
7     Gene7
14   Gene14

我很确定是表中存在NA 导致了问题，但是我花了很长时间试图纠正这个问题并且已经没有耐心了。如果有人可以提供帮助，将不胜感激。

【问题讨论】：

c"*" 应该是c("*") 吗？请确保您在将代码发布到问题中之前已经对其进行了测试，通过问题中的简单拼写错误导致的语法错误可能会令人沮丧，并且并不总是清楚它们不是您真实代码中的错误。跨度> 复制示例数据时，版本之间出现复制粘贴错误，抱歉，现已修复。它只返回那些行，因为其他行有缺失值 (NAs)！是的，我知道，我想知道如何告诉代码只关注我指定的列。我认为将默认变量设置为通配符* 会使其接受这些列中的任何内容，因此只会对我指定的变量进行子集化，但我不知道如何使通配符也适用于NA 如果您期望基因 2 和 3，那么这表明在这些字段中有 NA 应该允许匹配。但是，按照这种逻辑，这意味着基因 5、6、8、10、11 和 13 也应该匹配。我认为您需要考虑和/或更好地沟通在您的逻辑中应如何考虑 NA 值。 【参考方案1】：

您打算使用的通配符* 具有作为正则表达式的特定含义，这就是您告诉grepl 要接受哪些值的方式- 这意味着前面的字符重复0 次或多次。另外，我相信您希望在 grepl 表达式之间进行布尔 OR (|) 操作，因为您需要其中一列与模式匹配的任何行。

这是一个使用tidyverse 的可能更简单的解决方案，使用单独的“基于行的过滤”和“列选择”步骤：

library(tidyverse)

dataset <-  # small subset of your data, rows 1-4 should match but not 5
  tribble(
    ~GENENAME,    ~Tissue1,     ~Tissue2,     ~Tissue3,
    "Gene1", "CellType_AA", "CellType_BB", "CellType_G",
    "Gene2", "CellType_AA", "CellType_BB", NA,
    "Gene3", "CellType_AA", NA, NA,
    "Gene4", "CellType_AA", "CellType_BB", "CellType_G",
    "Gene5", NA, NA, "CellType_G"
    )

desired_pattern <- "CellType_A"  # note that this already implies that any other character can follow, e.g. this will match CellType_AA, CellType_AB, etc.

dataset %>%
  select(all_of(c("GENENAME","Tissue1","Tissue2","Tissue3"))) %>%  # the column selection
  filter(if_any(  # this is a tad confusing: return the row if any of the specified columns matches the condition...
    .cols = all_of(c("Tissue1", "Tissue2", "Tissue3")),  # specify which columns to check
    .fns = ~ stringr::str_detect(.x, pattern = desired_pattern)  # specify the condition...str_detect() is basically grepl() under the hood
  ))

要更改为以 A 或 B 开头的匹配单元格类型，您可以相应地更改模式：

desired_pattern  <- ""  # this will match any cell type that starts with A or B

编辑：

要查找与其中一列中的CellType_A 和另一列中的CellType_B 匹配的行，您可以执行两个连续的过滤步骤：

dataset %>%
  select(all_of(c("GENENAME","Tissue1","Tissue2","Tissue3"))) %>%  # the column selection
  filter(if_any(  # in this step, keep only rows that contain at least one `CellType_A`
    .cols = all_of(c("Tissue1", "Tissue2", "Tissue3")),  # specify which columns to check
    .fns = ~ stringr::str_detect(.x, pattern = "CellType_A")
  )) %>%
  filter(if_any(  # in this step, keep only rows that contain at least one `CellType_B`
    .cols = all_of(c("Tissue1", "Tissue2", "Tissue3")),  # specify which columns to check
    .fns = ~ stringr::str_detect(.x, pattern = "CellType_B")
  ))

以上两个过滤步骤的顺序无关紧要（您可以尝试交换它们来说服自己！）

【讨论】：

谢谢，这似乎有效！如果我只想返回那些行，例如CellType_AA 和CellTypeBB，我将如何修改模式。我自己也尝试过使用| OR 分隔符，但我一直收到错误消息‘|’ not meaningful for factors 我添加了一些关于包括多个匹配类型的内容，例如甲或乙；至于因素的问题，这有点棘手 - 您需要先将因素转换为其字符值，例如通过在 grepl 中包含 as.character() ，如下所示：grepl(T1, as.character(dataset$Tissue1) | grepl(T2, as.character(dataset$Tissue2)) 感谢这有很大帮助并且有效。但有一件事，所需的模式是识别具有 CellType_A 和 CellType_B 的行，而不是 OR 啊，我明白了，我专注于在多个列中正确获取相同条件的逻辑；在这种情况下，我将执行两个步骤，首先对 CellType_A 进行“过滤”，然后对 CellType_B 进行“过滤”（反之亦然，顺序无关紧要） - 这样您就剩下至少包含一个的行

以上是关于对包含因子、NA 值和通配符的数据框进行子集化的主要内容，如果未能解决你的问题，请参考以下文章