使用 dplyr - R 检查组中的字符是不是全部相等
Posted
技术标签:
【中文标题】使用 dplyr - R 检查组中的字符是不是全部相等【英文标题】:Check if characters are all equal in a group using dplyr - R使用 dplyr - R 检查组中的字符是否全部相等 【发布时间】:2018-10-14 11:26:03 【问题描述】:在以下数据框中,如何按前两列分组并检查第四列中的所有值是否相同?如果它们相同,我想用 ''
替换它们。
在此示例中,组组合 'embryonated + protein'
和 'Hatching + Lipid'
是仅有的两个字母不全是 a
的组。
df
Stage variable Temperature letters Mean
30 Embryonated Moisture 30 a 808.70882
31 Embryonated NFE 20 a 53.28806
32 Embryonated NFE 25 a 45.38572
33 Embryonated NFE 30 a 84.56113
34 Embryonated Protein 20 ab 118.53608
35 Embryonated Protein 25 b 127.29849
36 Embryonated Protein 30 a 84.55175
37 Hatching Ash 20 a 16.95345
38 Hatching Ash 25 a 14.54980
39 Hatching Ash 30 a 13.38510
40 Hatching Energy 20 a 4931.18857
41 Hatching Energy 25 a 4187.27213
42 Hatching Energy 30 a 4314.61171
43 Hatching Lipid 20 b 26.44363
44 Hatching Lipid 25 a 19.90928
45 Hatching Lipid 30 ab 22.27561
46 Hatching Moisture 20 a 785.63062
47 Hatching Moisture 25 a 818.69860
48 Hatching Moisture 30 a 815.32070
49 Hatching NFE 20 a 60.34359
50 Hatching NFE 25 a 43.02979
我曾尝试使用dplyr
无济于事。
grp_cols <- names(df)[c(1,2)] #group by stage and variable
# Convert character vector to list of symbols
dots <- lapply(grp_cols3, as.symbol)
res = df %>% group_by(.dots=dots) %>%
do(k=all(letters=='a')) #(returns all groups as `FALSE`)
数据:
dput(df)
structure(list(Stage = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Developing",
"Embryonated", "Hatching", "Laid"), class = "factor"), variable = structure(c(1L,
5L, 5L, 5L, 2L, 2L, 2L, 4L, 4L, 4L, 6L, 6L, 6L, 3L, 3L, 3L, 1L,
1L, 1L, 5L, 5L), .Label = c("Moisture", "Protein", "Lipid", "Ash",
"NFE", "Energy"), class = "factor"), Temperature = c("30", "20",
"25", "30", "20", "25", "30", "20", "25", "30", "20", "25", "30",
"20", "25", "30", "20", "25", "30", "20", "25"), letters = c("a",
"a", "a", "a", "ab", "b", "a", "a", "a", "a", "a", "a", "a",
"b", "a", "ab", "a", "a", "a", "a", "a"), Mean = c(808.708818349727,
53.2880626188374, 45.3857220182952, 84.5611267892406, 118.536080769588,
127.298486932385, 84.5517498179938, 16.9534468121571, 14.5497954869813,
13.3850951354759, 4931.18857123979, 4187.27213494545, 4314.61171127083,
26.4436265667305, 19.9092762683653, 22.2756088142943, 785.630624024365,
818.698598619779, 815.320702070777, 60.3435858953567, 43.0297881562102
)), .Names = c("Stage", "variable", "Temperature", "letters",
"Mean"), row.names = 30:50, class = "data.frame")
【问题讨论】:
【参考方案1】:按每组拆分数据,查找 n_distinct
值,然后在这种情况下替换为 ''
:
df %>%
group_by(Stage,variable) %>%
mutate(letters = replace(letters, n_distinct(letters)==1, '') )
类似的逻辑也适用于data.table
:
library(data.table)
setDT(df)
df[, letters := if(uniqueN(letters)==1) '' else letters, by=.(Stage,variable)]
【讨论】:
谢谢,太好了。但是只有第二个选项对我有用? @J.Con - 我刚刚在您提供的示例df
上再次测试了 dplyr 代码,它对我来说效果很好。代码出错了吗?
不,它只是像以前一样产生 df,所有字母仍然存在。奇怪。
@J.Con - 你用新结果覆盖了原来的 df df <- df %>% ...
等等?
你有另一个包掩蔽mutate
来自dplyr
吗?我遇到了一些问题,例如plyr
也有一个变异,可能会导致一些意想不到的结果以上是关于使用 dplyr - R 检查组中的字符是不是全部相等的主要内容,如果未能解决你的问题,请参考以下文章
R语言使用dplyr包的select函数基于字符串向量vector中的字段名称筛选dataframe或者tibble中的数据( Select varibales)
R语言dplyr包使用recode函数进行数据列内容编码转换实战:类似于pandas中的map函数(例如,将内容从字符串映射到数值)