如何根据应用于大量列的“不等于”标准对数据框进行子集化?
Posted
技术标签:
【中文标题】如何根据应用于大量列的“不等于”标准对数据框进行子集化?【英文标题】:How to subset dataframe based on a "not equal to" criteria applied to a large number of columns? 【发布时间】:2019-08-20 09:46:37 【问题描述】:我是 R 新手,目前正尝试根据我预定义的排除标准对我的数据进行子集分析。我目前正在尝试删除所有患有痴呆症的病例,如 ICD-10 编码的那样。问题是有多个变量包含关于每个人的疾病状态的信息(约 70 个变量),尽管它们以相同的方式编码,相同的条件可以应用于所有这些。
一些模拟数据:
#Create dataframe containing simulated data
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005,1006,1007,1008,1009,1010,1011),
disease_code_1 = c('I802','H356','G560','D235','B178','F011','F023','C761','H653','A049','J679'),
disease_code_2 = c('A071','NA','G20','NA','NA','A049','NA','NA','G300','G308','A045'),
disease_code_3 = c('H250','NA','NA','I802','NA','A481','NA','NA','NA','NA','D352'))
#data is structured as below:
ID disease_code_1 disease_code_2 disease_code_3
1 1001 I802 A071 H250
2 1002 H356 NA NA
3 1003 G560 G20 NA
4 1004 D235 NA I802
5 1005 B178 NA NA
6 1006 F011 A049 A481
7 1007 F023 NA NA
8 1008 C761 NA NA
9 1009 H653 G300 NA
10 1010 A049 G308 NA
11 1011 J679 A045 D352
在这里,我正在尝试删除所有“disease_code”变量中包含“痴呆症代码”的病例。
#Remove cases with dementia from dataframe (e.g. F023, G20)
Newdata_df <- subset(df, (2:4 != "F023"|"G20"|"F009"|"F002"|"F001"|"F000"|"F00"|
"G309"| "G308"|"G301"|"G300"|"G30"| "F01"|"F018"|"F013"|
"F012"| "F011"| "F010"|"F01"))
我收到的错误是:
Error in 2:4 != "F023" | "G20" :
operations are possible only for numeric, logical or complex types
理想情况下,子集数据框应如下所示:
ID disease_code_1 disease_code_2 disease_code_3
1 1001 I802 A071 H250
2 1002 H356 NA NA
4 1004 D235 NA I802
5 1005 B178 NA NA
8 1008 C761 NA NA
11 1011 J679 A045 D352
我知道我的代码中有错误,但我不确定如何修复它。我尝试了其他一些方法(使用 dplyr),但到目前为止还没有运气。
非常感谢任何帮助!
【问题讨论】:
您应该将数据重塑为长格式。这将使您的生活(和分析)更加轻松。 并牢记 CRAN 包 icd 以保持理智。许多与此类似的问题都受益于或需要应用合并症图,icd
使用经过充分验证的广泛引用的疾病图非常仔细和快速地完成了这些工作。这并不能回答您的问题,但使用此技术可能会避免此问题,具体取决于您已经完成了什么以及您将如何处理数据。
【参考方案1】:
我们可以从data.table
使用melt/dcast
library(data.table)
dcast(melt(setDT(df), id.var = 'ID')[,
if(!any(value %in% dementia_codes)) .SD, .(ID)], ID ~ variable)
# ID disease_code_1 disease_code_2 disease_code_3
#1: 1001 I802 A071 H250
#2: 1002 H356 NA NA
#3: 1004 D235 NA I802
#4: 1005 B178 NA NA
#5: 1008 C761 NA NA
#6: 1011 J679 A045 D352
或者这可以在base R
中更紧凑地完成,无需重新整形
df[!Reduce(`|`, lapply(df[-1], `%in%` , dementia_codes)),]
# ID disease_code_1 disease_code_2 disease_code_3
#1 1001 I802 A071 H250
#2 1002 H356 NA NA
#4 1004 D235 NA I802
#5 1005 B178 NA NA
#8 1008 C761 NA NA
#11 1011 J679 A045 D352
数据
dementia_codes <- c("F023", "G20", "F009", "F002", "F001", "F000",
"F00", "G309", "G308", "G301", "G300", "G30", "F01", "F018", "F013",
"F012", "F011", "F010", "F01")
【讨论】:
【参考方案2】:这个怎么样:
> dementia <- c("F023", "G20", "F009", "F002", "F001", "F000", "F00", "G309", "G308",
+ "G301", "G300", "G30", "F01", "F018", "F013", "F012", "F011", "F010", "F01")
>
> dementia <- apply(sapply(df[, -1], function(x) x %in% dementia), 1, any)
>
> df[!dementia,]
ID disease_code_1 disease_code_2 disease_code_3
1 1001 I802 A071 H250
2 1002 H356 NA NA
4 1004 D235 NA I802
5 1005 B178 NA NA
8 1008 C761 NA NA
11 1011 J679 A045 D352
>
编辑:
一个更优雅的解决方案,感谢@Ronan Shah:
> df[apply(df[-1], 1, function(x) !any(x %in% dementia)),]
ID disease_code_1 disease_code_2 disease_code_3
1 1001 I802 A071 H250
2 1002 H356 NA NA
4 1004 D235 NA I802
5 1005 B178 NA NA
8 1008 C761 NA NA
11 1011 J679 A045 D352
希望对你有帮助。
【讨论】:
@Ronan Shah 不错!它是一个更优雅的解决方案。你应该发布它。【参考方案3】:dplyr
的一种可能是:
df %>%
filter_at(vars(2:4), all_vars(! . %in% c("F023","G20","F009","F002","F001","F000","F00",
"G309", "G308","G301","G300","G30", "F01","F018","F013",
"F012", "F011", "F010","F01")))
ID disease_code_1 disease_code_2 disease_code_3
1 1001 I802 A071 H250
2 1002 H356 NA NA
3 1004 D235 NA I802
4 1005 B178 NA NA
5 1008 C761 NA NA
6 1011 J679 A045 D352
在这种情况下,它会检查 2:4 列是否包含任何给定的代码。
或者:
df %>%
filter_at(vars(contains("disease_code")), all_vars(! . %in% c("F023","G20","F009","F002","F001","F000","F00",
"G309", "G308","G301","G300","G30", "F01","F018","F013",
"F012", "F011", "F010","F01")))
在这种情况下,它会检查名称为 disease_code
的任何列是否包含任何给定代码。
【讨论】:
感谢大家的建议!感谢您还解释了您建议的代码的作用@tmfmnk - 非常有用!【参考方案4】:带有base
R 的for
循环版本,如果您愿意的话。
df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005,1006,1007,1008,1009,1010,1011),
disease_code_1 = c('I802','H356','G560','D235','B178','F011','F023','C761','H653','A049','J679'),
disease_code_2 = c('A071','NA','G20','NA','NA','A049','NA','NA','G300','G308','A045'),
disease_code_3 = c('H250','NA','NA','I802','NA','A481','NA','NA','NA','NA','D352'), stringsAsFactors = FALSE)
dementia_codes <- c("F023", "G20", "F009", "F002", "F001", "F000", "F00", "G309", "G308", "G301", "G300", "G30", "F01", "F018", "F013", "F012", "F011", "F010", "F01")
new_df <- df[0,]
for(i in 1:nrow(df))
currRow <- df[i,]
if(any(dementia_codes %in% as.character(currRow)) == FALSE)
new_df <- rbind(new_df, currRow)
new_df
# ID disease_code_1 disease_code_2 disease_code_3
# 1 1001 I802 A071 H250
# 2 1002 H356 NA NA
# 4 1004 D235 NA I802
# 5 1005 B178 NA NA
# 8 1008 C761 NA NA
# 11 1011 J679 A045 D352
【讨论】:
【参考方案5】:正如@docendo discimus 在 cmets 中所提到的,我们可以使用 gather
、group_by
ID
将数据帧转换为长格式,然后只选择其中没有 dementia_code
的 ID
s,然后选择 @ 987654326@他们回到宽格式。
library(tidyverse)
df %>%
gather(key, value, -ID) %>%
group_by(ID) %>%
filter(!any(value %in% dementia_code)) %>%
spread(key, value)
# ID disease_code_1 disease_code_2 disease_code_3
# <dbl> <chr> <chr> <chr>
#1 1001 I802 A071 H250
#2 1002 H356 NA NA
#3 1004 D235 NA I802
#4 1005 B178 NA NA
#5 1008 C761 NA NA
#6 1011 J679 A045 D352
数据
dementia_code <- c("F023", "G20", "F009", "F002", "F001", "F000", "F00", "G309",
"G308","G301", "G300", "G30", "F01", "F018", "F013", "F012", "F011", "F010", "F01")
【讨论】:
为什么要加载所有tidyverse
?这不就是tidyr
和dplyr
吗?
@Dunois 是的,是的。我有默认加载所有内容的习惯:P
我们也可以使用anti_join
,例如Newdata_df <- df %>% anti_join(df %>% gather(DiseaseCodeNumber, CodeValue, -ID) %>% filter(CodeValue %in% c("F023","G20","F009","F002","F001","F000","F00", "G309", "G308","G301","G300","G30","F01","F018","F013", "F012", "F011","F010","F01")), by = "ID")
【参考方案6】:
我们可以创建一个带有要删除的代码的向量并使用rowSums
来删除,即
codes_to_remove <- c("F023", "G20", "F009", "F002", "F001", "F000", "F00", "G309", "G308",
"G301", "G300", "G30", "F01", "F018", "F013", "F012", "F011", "F010", "F01")
df[rowSums(sapply(df[-1], `%in%`, codes_to_remove)) == 0,]
给出,
ID disease_code_1 disease_code_2 disease_code_3 1 1001 I802 A071 H250 2 1002 H356 NA NA 4 1004 D235 NA I802 5 1005 B178 NA NA 8 1008 C761 NA NA 11 1011 J679 A045 D352
【讨论】:
以上是关于如何根据应用于大量列的“不等于”标准对数据框进行子集化?的主要内容,如果未能解决你的问题,请参考以下文章