将lfile中的元素提取到R中的列表中

Posted 2023-02-14

技术标签:

【中文标题】将lfile中的元素提取到R中的列表中【英文标题】：Extract elements from lfile into a list in R 【发布时间】：2022-01-19 21:46:40 【问题描述】：

我有一个文件，例如：

>Nscaffold_033778.1_22 [24674 - 24880] some information  
ACCATTAAGAGAGAAAAGAGAGGAGAGAGAGAGAGGAGAGAGAGAGAGAGGagAGGAAGA
AGGAGAGAGA
>NC_0337652.1_23 [26291 - 26443] some other informations boring
MMDOODODODODNJBCIOICICVOCICVCPCCM
>contig_033652.1_24 [25507 - 26529] species with informations 
AJGSIVPDYVPDYVDPYVDPYDVDYVPVYVIYVDPIDVYPDIVYDPIYVDPIDVYPDIVP
PUDVPIYDVPDIVPDIDVPDVPDIVDPVDIVPDIVPDIVDPIDVIDVPDDIVPDDVPDVD
DDGGDDGDDIDIDDFDUDUDTTUDDUCDUDCDCC

我想以列表格式仅提取以下信息：

名单：

[[1]]
[1] "Nscaffold_033778.1_22" "24674"        "24880"       

[[2]]
[1] "NC_0337652.1_23" "26291"        "26443"       

[[3]]
[1] "contig_033652.1_24" "25507"         "26529"

有人有想法吗？？

列表的第一个元素是">"符号之后的部分，列表的第二个元素是[] 中的first number 列表的第三个元素是[] 中的second number

【问题讨论】：

文件/数据格式是什么？你尝试过什么想法但没有成功？ 【参考方案1】：

我们可以读取带有readLines、grep 行的文件，提取相关信息并拆分

strsplit(sub("^>(\\S+)\\s+\\[(\\d+)\\D+(\\d+)\\].*", "\\1,\\2,\\3", 
    grep(">", lines, value = TRUE)), ",")

-输出

[[1]]
[1] "Nscaffold_033778.1_22" "24674"                 "24880"                

[[2]]
[1] "NC_0337652.1_23" "26291"           "26443"          

[[3]]
[1] "contig_033652.1_24" "25507"              "26529"

数据

lines <- readLines('file.txt')

【讨论】：

【参考方案2】：

如果vec 持有文件内容，

vec <- readLines(...)

然后

strcapture("^>(.*) *\\[(\\d+)\\D*(\\d+).*",
           vec[grepl("^>", vec)],
           list(x="",y="",z=""))
#                        x     y     z
# 1 Nscaffold_033778.1_22  24674 24880
# 2       NC_0337652.1_23  26291 26443
# 3    contig_033652.1_24  25507 26529

我知道这不是严格要求的格式。我提供它作为替代品，因为它可以随时访问所有内容。此外，如果您打算对列 y 和 z 进行整数化，则可以通过将第三个参数 (proto=) 替换为 list(x="", y=1L, z=1L) 来完成内置操作，如

str(
  strcapture("^>(.*) *\\[(\\d+)\\D*(\\d+).*",
             vec[grepl("^>", vec)],
             list(x="",y="",z=""))
)
# 'data.frame': 3 obs. of  3 variables:
#  $ x: chr  "Nscaffold_033778.1_22 " "NC_0337652.1_23 " "contig_033652.1_24 "
#  $ y: chr  "24674" "26291" "25507"
#  $ z: chr  "24880" "26443" "26529"

str(
  strcapture("^>(.*) *\\[(\\d+)\\D*(\\d+).*",
             vec[grepl("^>", vec)],
             list(x="",y=1L,z=1L))
)
# 'data.frame': 3 obs. of  3 variables:
#  $ x: chr  "Nscaffold_033778.1_22 " "NC_0337652.1_23 " "contig_033652.1_24 "
#  $ y: int  24674 26291 25507
#  $ z: int  24880 26443 26529

【讨论】：

【参考方案3】：

你可以使用 unglue ：

data <- c(">Nscaffold_033778.1_22 [24674 - 24880] some information",
  ">NC_0337652.1_23 [26291 - 26443] some other information",
  ">contig_033652.1_24 [25507 - 26529] species with informations")

unglue::unglue_data(data, ">id [n1 - n2]")
#>                      id    n1    n2
#> 1 Nscaffold_033778.1_22 24674 24880
#> 2       NC_0337652.1_23 26291 26443
#> 3    contig_033652.1_24 25507 26529

^{由reprex package (v2.0.1) 于 2021 年 12 月 17 日创建}

【讨论】：

以上是关于将lfile中的元素提取到R中的列表中的主要内容，如果未能解决你的问题，请参考以下文章