通过按字母顺序仅对一行中的一些字段进行排序来重塑 R 中的数据框
Posted
技术标签:
【中文标题】通过按字母顺序仅对一行中的一些字段进行排序来重塑 R 中的数据框【英文标题】:Reshaping a dataframe in R by sorting just some fields in a row alphabetically 【发布时间】:2022-01-05 14:50:48 【问题描述】:我在 RStudio 中有一些大型数据框,它们具有这种结构:
Original data structure
structure(list(CHROM = c("scaffold1000|size223437", "scaffold1000|size223437",
"scaffold1000|size223437", "scaffold1000|size223437"), POS = c(666,
1332, 3445, 4336), REF = c("A", "TA", "CTTGA", "GCTA"), RO = c(20,
14, 9, 25), ALT_1 = c("GAT", "TGC", "AGC", "T"), ALT_2 = c("CAG",
"TGA", "CGC", NA), ALT_3 = c("G", NA, "TGA", NA), ALT_4 = c("AGT",
NA, NA, NA), AO_1 = c(13, 4, 67, 120), AO_2 = c(12, 5, 34, NA
), AO_3 = c(6, NA, 18, NA), AO_4 = c(101, NA, NA, NA), AOF_1 = c(8.55263157894737,
17.3913043478261, 52.34375, 82.7586206896552), AOF_2 = c(7.89473684210526,
21.7391304347826, 26.5625, NA), AOF_3 = c(3.94736842105263, NA,
14.0625, NA), AOF_4 = c(66.4473684210526, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-4L))
但为了进行分析,我需要它看起来像这样:
Desired output
structure(list(CHROM = c("scaffold1000|size223437", "scaffold1000|size223437",
"scaffold1000|size223437", "scaffold1000|size223437"), POS = c(666,
1332, 3445, 4336), REF = c("A", "TA", "CTTGA", "GCTA"), RO = c(20,
14, 9, 25), ALT_1 = c("AGT", "TGA", "AGC", "T"), ALT_2 = c("CAG",
"TGC", "CGC", NA), ALT_3 = c("G", NA, "TGA", NA), ALT_4 = c("GAT",
NA, NA, NA), AO_1 = c(101, 5, 67, 120), AO_2 = c(12, 4, 34, NA
), AO_3 = c(6, NA, 18, NA), AO_4 = c(13, NA, NA, NA), AOF_1 = c(66.4473684210526,
21.7391304347826, 52.34375, 82.7586206896552), AOF_2 = c(7.89473684210526,
17.3913043478261, 26.5625, NA), AOF_3 = c(3.94736842105263, NA,
14.0625, NA), AOF_4 = c(8.55263157894737, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-4L))
所以我想做的是以某种方式重新排列一行的内容,列 ALT_1、ALT_2、ALT_3、ALT_4 是按字母顺序排序的,但同时我还需要重新排列相应的列AO 和 AOF,使值仍然匹配。 (AO_1 的值仍应与 ALT_1 中的序列匹配。 所以如果 ALT_1 在排序后的数据帧中变成 ALT_2,AO_1 也应该变成 AO_2)
到目前为止我尝试过的,但没有奏效:
将 ALT_1、AO_1、AOF_1 的值全部粘贴到一个字段中,所以我将它们放在一起
if (is.na(X[i,6]) == FALSE)
X[i,6] <- paste(X[i,6],X[i,10],X[i,14],sep=" ")
然后我想将每一行提取为向量以对值进行排序并将其放回数据框中,但我没有设法做到这一点。
所以问题是我如何订购数据框以获得所需的输出? (我需要将此应用于 32 个数据帧,每个数据帧具有 >100.000 个值)
【问题讨论】:
请不要将数据共享为图像,请改用dput()
。谢谢。
【参考方案1】:
这是data.table
方法
library(data.table)
# Set to data.table format
setDT(mydata)
# Melt to long format
DT.melt <- melt(mydata, measure.vars = patterns(ALT = "^ALT_", AO = "^AO_", AOF = "^AOF_"))
# order by groups, na's at the end
setorderv(DT.melt, cols = c("CHROM", "POS", "ALT"), na.last = TRUE)
# cast to wide again, use rowid() for numbering
dcast(DT.melt, CHROM + POS + REF + RO ~ rowid(REF), value.var = list("ALT", "AO", "AOF"))
# CHROM POS REF RO ALT_1 ALT_2 ALT_3 ALT_4 AO_1 AO_2 AO_3 AO_4 AOF_1 AOF_2 AOF_3 AOF_4
# 1: scaffold1000|size223437 666 A 20 AGT CAG G GAT 101 12 6 13 66.44737 7.894737 3.947368 8.552632
# 2: scaffold1000|size223437 1332 TA 14 TGA TGC <NA> <NA> 5 4 NA NA 21.73913 17.391304 NA NA
# 3: scaffold1000|size223437 3445 CTTGA 9 AGC CGC TGA <NA> 67 34 18 NA 52.34375 26.562500 14.062500 NA
# 4: scaffold1000|size223437 4336 GCTA 25 T <NA> <NA> <NA> 120 NA NA NA 82.75862 NA NA NA
【讨论】:
非常感谢。我理解前三个步骤,但在最后一步(再次转换为宽)rowid(REF)
究竟做了什么?你写它是为了编号,但如果我只执行rowid(REF)
输出是1 1 1 1
?
rowid(REF)
使用 REF 作为分组列创建行号。所以每组中的第一行得到 1,第二行得到 2,依此类推……这个数字用作值列的后缀。查看?rowid()
和rowid(DT.melt$REF)
的输出了解更多信息。
也许更易读的是您将最后一行(使用 dcast)替换为以下两行:DT.melt[, number := rowid(REF)]; dcast(DT.melt, CHROM + POS + REF + RO ~ number, value.var = list("ALT", "AO", "AOF"))
【参考方案2】:
这里是dplyr
解决方案。花了我一些时间,我需要一些帮助pivot_wider dissolves arrange:
library(dplyr)
library(tidyr)
df1 %>%
mutate(id = row_number()) %>%
unite("conc1", c(ALT_1, AO_1, AOF_1), sep = "_") %>%
unite("conc2", c(ALT_2, AO_2, AOF_2), sep = "_") %>%
unite("conc3", c(ALT_3, AO_3, AOF_3), sep = "_") %>%
unite("conc4", c(ALT_4, AO_4, AOF_4), sep = "_") %>%
pivot_longer(
starts_with("conc")
) %>%
mutate(value = ifelse(value=="NA_NA_NA", NA_character_, value)) %>%
group_by(id) %>%
mutate(value = sort(value, na.last = TRUE)) %>%
ungroup() %>%
pivot_wider(
names_from = name,
values_from = value,
values_fill = "0"
) %>%
separate(conc1, c("ALT_1", "AO_1", "AOF_1"), sep = "_") %>%
separate(conc2, c("ALT_2", "AO_2", "AOF_2"), sep = "_") %>%
separate(conc3, c("ALT_3", "AO_3", "AOF_3"), sep = "_") %>%
separate(conc4, c("ALT_4", "AO_4", "AOF_4"), sep = "_") %>%
select(CHROM, POS, REF, RO, starts_with("ALT"), starts_with("AO_"), starts_with("AOF_")) %>%
type.convert(as.is=TRUE)
CHROM POS REF RO ALT_1 ALT_2 ALT_3 ALT_4 AO_1 AO_2 AO_3 AO_4 AOF_1 AOF_2 AOF_3 AOF_4
<chr> <int> <chr> <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 scaffold1000|size223437 666 A 20 AGT CAG G GAT 101 12 6 13 66.4 7.89 3.95 8.55
2 scaffold1000|size223437 1332 TA 14 TGA TGC NA NA 5 4 NA NA 21.7 17.4 NA NA
3 scaffold1000|size223437 3445 CTTGA 9 AGC CGC TGA NA 67 34 18 NA 52.3 26.6 14.1 NA
4 scaffold1000|size223437 4336 GCTA 25 T NA NA NA 120 NA NA NA 82.8 NA NA NA
【讨论】:
非常感谢它完美运行,我理解这些步骤!以上是关于通过按字母顺序仅对一行中的一些字段进行排序来重塑 R 中的数据框的主要内容,如果未能解决你的问题,请参考以下文章