如何根据应用于大量列的“不等于”标准对数据框进行子集化?

Posted

技术标签:

【中文标题】如何根据应用于大量列的“不等于”标准对数据框进行子集化?【英文标题】:How to subset dataframe based on a "not equal to" criteria applied to a large number of columns? 【发布时间】:2019-08-20 09:46:37 【问题描述】:

我是 R 新手,目前正尝试根据我预定义的排除标准对我的数据进行子集分析。我目前正在尝试删除所有患有痴呆症的病例,如 ICD-10 编码的那样。问题是有多个变量包含关于每个人的疾病状态的信息(约 70 个变量),尽管它们以相同的方式编码,相同的条件可以应用于所有这些。

一些模拟数据:

#Create dataframe containing simulated data
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005,1006,1007,1008,1009,1010,1011),
                    disease_code_1 = c('I802','H356','G560','D235','B178','F011','F023','C761','H653','A049','J679'),
                    disease_code_2 = c('A071','NA','G20','NA','NA','A049','NA','NA','G300','G308','A045'),
                    disease_code_3 = c('H250','NA','NA','I802','NA','A481','NA','NA','NA','NA','D352'))

#data is structured as below:

     ID disease_code_1 disease_code_2 disease_code_3
1  1001           I802           A071           H250
2  1002           H356             NA             NA
3  1003           G560            G20             NA
4  1004           D235             NA           I802
5  1005           B178             NA             NA
6  1006           F011           A049           A481
7  1007           F023             NA             NA
8  1008           C761             NA             NA
9  1009           H653           G300             NA
10 1010           A049           G308             NA
11 1011           J679           A045           D352


在这里,我正在尝试删除所有“disease_code”变量中包含“痴呆症代码”的病例。

#Remove cases with dementia from dataframe (e.g. F023, G20)
Newdata_df <- subset(df, (2:4 != "F023"|"G20"|"F009"|"F002"|"F001"|"F000"|"F00"|    
                    "G309"| "G308"|"G301"|"G300"|"G30"| "F01"|"F018"|"F013"|
                    "F012"| "F011"| "F010"|"F01"))

我收到的错误是:

Error in 2:4 != "F023" | "G20" : 
  operations are possible only for numeric, logical or complex types

理想情况下,子集数据框应如下所示:

     ID disease_code_1 disease_code_2 disease_code_3
1  1001           I802           A071           H250
2  1002           H356             NA             NA
4  1004           D235             NA           I802
5  1005           B178             NA             NA
8  1008           C761             NA             NA
11 1011           J679           A045           D352

我知道我的代码中有错误,但我不确定如何修复它。我尝试了其他一些方法(使用 dplyr),但到目前为止还没有运气。

非常感谢任何帮助!

【问题讨论】:

您应该将数据重塑为长格式。这将使您的生活(和分析)更加轻松。 并牢记 CRAN 包 icd 以保持理智。许多与此类似的问题都受益于或需要应用合并症图,icd 使用经过充分验证的广泛引用的疾病图非常仔细和快速地完成了这些工作。这并不能回答您的问题,但使用此技术可能会避免此问题,具体取决于您已经完成了什么以及您将如何处理数据。 【参考方案1】:

我们可以从data.table使用melt/dcast

library(data.table)
dcast(melt(setDT(df), id.var = 'ID')[,
     if(!any(value %in% dementia_codes)) .SD, .(ID)], ID ~ variable)
#    ID disease_code_1 disease_code_2 disease_code_3
#1: 1001           I802           A071           H250
#2: 1002           H356             NA             NA
#3: 1004           D235             NA           I802
#4: 1005           B178             NA             NA
#5: 1008           C761             NA             NA
#6: 1011           J679           A045           D352

或者这可以在base R 中更紧凑地完成,无需重新整形

df[!Reduce(`|`, lapply(df[-1], `%in%` , dementia_codes)),]
 #   ID disease_code_1 disease_code_2 disease_code_3
#1  1001           I802           A071           H250
#2  1002           H356             NA             NA
#4  1004           D235             NA           I802
#5  1005           B178             NA             NA
#8  1008           C761             NA             NA
#11 1011           J679           A045           D352

数据

dementia_codes <- c("F023", "G20", "F009", "F002", "F001", "F000", 
  "F00", "G309", "G308", "G301", "G300", "G30", "F01", "F018", "F013", 
   "F012", "F011", "F010", "F01")

【讨论】:

【参考方案2】:

这个怎么样:

> dementia <- c("F023", "G20", "F009", "F002", "F001", "F000", "F00", "G309", "G308",
+               "G301", "G300", "G30", "F01", "F018", "F013", "F012", "F011", "F010", "F01")
> 
> dementia <- apply(sapply(df[, -1], function(x) x %in% dementia), 1, any)
> 
> df[!dementia,]
     ID disease_code_1 disease_code_2 disease_code_3
1  1001           I802           A071           H250
2  1002           H356             NA             NA
4  1004           D235             NA           I802
5  1005           B178             NA             NA
8  1008           C761             NA             NA
11 1011           J679           A045           D352
> 

编辑:

一个更优雅的解决方案,感谢@Ronan Shah:

> df[apply(df[-1], 1, function(x) !any(x %in% dementia)),]
     ID disease_code_1 disease_code_2 disease_code_3
1  1001           I802           A071           H250
2  1002           H356             NA             NA
4  1004           D235             NA           I802
5  1005           B178             NA             NA
8  1008           C761             NA             NA
11 1011           J679           A045           D352

希望对你有帮助。

【讨论】:

@Ronan Shah 不错!它是一个更优雅的解决方案。你应该发布它。【参考方案3】:

dplyr 的一种可能是:

df %>%
 filter_at(vars(2:4), all_vars(! . %in% c("F023","G20","F009","F002","F001","F000","F00",    
            "G309", "G308","G301","G300","G30", "F01","F018","F013",
            "F012", "F011", "F010","F01")))

    ID disease_code_1 disease_code_2 disease_code_3
1 1001           I802           A071           H250
2 1002           H356             NA             NA
3 1004           D235             NA           I802
4 1005           B178             NA             NA
5 1008           C761             NA             NA
6 1011           J679           A045           D352

在这种情况下,它会检查 2:4 列是否包含任何给定的代码。

或者:

df %>%
 filter_at(vars(contains("disease_code")), all_vars(! . %in% c("F023","G20","F009","F002","F001","F000","F00",    
            "G309", "G308","G301","G300","G30", "F01","F018","F013",
            "F012", "F011", "F010","F01")))

在这种情况下,它会检查名称为 disease_code 的任何列是否包含任何给定代码。

【讨论】:

感谢大家的建议!感谢您还解释了您建议的代码的作用@tmfmnk - 非常有用!【参考方案4】:

带有base R 的for 循环版本,如果您愿意的话。

df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005,1006,1007,1008,1009,1010,1011),
                disease_code_1 = c('I802','H356','G560','D235','B178','F011','F023','C761','H653','A049','J679'),
                disease_code_2 = c('A071','NA','G20','NA','NA','A049','NA','NA','G300','G308','A045'),
                disease_code_3 = c('H250','NA','NA','I802','NA','A481','NA','NA','NA','NA','D352'), stringsAsFactors = FALSE)

dementia_codes <- c("F023", "G20", "F009", "F002", "F001", "F000", "F00", "G309", "G308", "G301", "G300", "G30", "F01", "F018", "F013", "F012", "F011", "F010", "F01")

new_df <- df[0,]

for(i in 1:nrow(df))
  currRow <- df[i,]
  if(any(dementia_codes %in% as.character(currRow)) == FALSE)
    new_df <- rbind(new_df, currRow)
  


new_df
#      ID disease_code_1 disease_code_2 disease_code_3
# 1  1001           I802           A071           H250
# 2  1002           H356             NA             NA
# 4  1004           D235             NA           I802
# 5  1005           B178             NA             NA
# 8  1008           C761             NA             NA
# 11 1011           J679           A045           D352

【讨论】:

【参考方案5】:

正如@docendo discimus 在 cmets 中所提到的,我们可以使用 gathergroup_by ID 将数据帧转换为长格式,然后只选择其中没有 dementia_codeIDs,然后选择 @ 987654326@他们回到宽格式。

library(tidyverse)

df %>%
   gather(key, value, -ID) %>%
   group_by(ID) %>%
   filter(!any(value %in% dementia_code)) %>%
   spread(key, value)

#   ID disease_code_1 disease_code_2 disease_code_3
#  <dbl> <chr>          <chr>          <chr>         
#1  1001 I802           A071           H250          
#2  1002 H356           NA             NA            
#3  1004 D235           NA             I802          
#4  1005 B178           NA             NA            
#5  1008 C761           NA             NA            
#6  1011 J679           A045           D352          

数据

dementia_code <- c("F023", "G20", "F009", "F002", "F001", "F000", "F00", "G309", 
"G308","G301", "G300", "G30", "F01", "F018", "F013", "F012", "F011", "F010", "F01")

【讨论】:

为什么要加载所有tidyverse?这不就是tidyrdplyr吗? @Dunois 是的,是的。我有默认加载所有内容的习惯:P 我们也可以使用anti_join,例如Newdata_df &lt;- df %&gt;% anti_join(df %&gt;% gather(DiseaseCodeNumber, CodeValue, -ID) %&gt;% filter(CodeValue %in% c("F023","G20","F009","F002","F001","F000","F00", "G309", "G308","G301","G300","G30","F01","F018","F013", "F012", "F011","F010","F01")), by = "ID")【参考方案6】:

我们可以创建一个带有要删除的代码的向量并使用rowSums来删除,即

codes_to_remove <- c("F023", "G20", "F009", "F002", "F001", "F000", "F00", "G309", "G308",
                "G301", "G300", "G30", "F01", "F018", "F013", "F012", "F011", "F010", "F01")

df[rowSums(sapply(df[-1], `%in%`, codes_to_remove)) == 0,]

给出,

     ID disease_code_1 disease_code_2 disease_code_3
1  1001           I802           A071           H250
2  1002           H356             NA             NA
4  1004           D235             NA           I802
5  1005           B178             NA             NA
8  1008           C761             NA             NA
11 1011           J679           A045           D352

【讨论】:

以上是关于如何根据应用于大量列的“不等于”标准对数据框进行子集化?的主要内容,如果未能解决你的问题,请参考以下文章

如何根据基于其他列的列对数据框进行排序[重复]

根据底行中的值对数据框列的顺序进行排序

如何根据 bin 的 x 值将颜色图应用于绘图直方图?

如何在不更改特定列的情况下对数据框中的数据进行重新采样?

如何在不更改特定列的情况下对数据框中的数据进行重新采样?

在Scala中转换所有数据框列的有效方法[重复]