字符串拆分data.table列生成NA

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了字符串拆分data.table列生成NA相关的知识,希望对你有一定的参考价值。

这是关于SO的第一个问题,请告诉我是否可以改进。我正在研究R中的自然语言处理项目,并且正在尝试构建包含测试用例的data.table。在这里,我构建了一个简化的示例:

texts.dt <- data.table(string = c("one", 
                                  "two words",
                                  "three words here",
                                  "four useless words here", 
                                  "five useless meaningless words here", 
                                  "six useless meaningless words here just",
                                  "seven useless meaningless words here just to",
                                  "eigth useless meaningless words here just to fill",
                                  "nine useless meaningless words here just to fill up",
                                  "ten useless meaningless words here just to fill up space"),
                       word.count = 1:10,
                       stop.at.word = c(0, 1, 2, 2, 4, 3, 3, 6, 7, 5))

这将返回我们将要处理的data.table:

                                                          string word.count stop.at.word
 1:                                                      one          1            0
 2:                                                two words          2            1
 3:                                         three words here          3            2
 4:                                  four useless words here          4            2
 5:                      five useless meaningless words here          5            4
 6:                  six useless meaningless words here just          6            3
 7:             seven useless meaningless words here just to          7            3
 8:        eigth useless meaningless words here just to fill          8            6
 9:      nine useless meaningless words here just to fill up          9            7
10: ten useless meaningless words here just to fill up space         10            5

在实际应用中,stop.at.word列中的值是随机确定的(上限= word.count - 1)。此外,字符串不按长度排序,但这不应该有所不同。

代码应该添加两列inputand output,其中inputcon包含从位置1到stop.at.wordand output的子字符串包含后面的单词(单个单词),如下所示:

>desired_result
                                                          string word.count stop.at.word                                       input
     1:                                                      one          1            0                                            
     2:                                                two words          2            1                                         two
     3:                                         three words here          3            2                                 three words
     4:                                  four useless words here          4            2                                four useless
     5:                      five useless meaningless words here          5            4              five useless meaningless words
     6:                  six useless meaningless words here just          6            2                                 six useless
     7:             seven useless meaningless words here just to          7            3                   seven useless meaningless
     8:        eigth useless meaningless words here just to fill          8            6   eigth useless meaningless words here just
     9:      nine useless meaningless words here just to fill up          9            7 nine useless meaningless words here just to
    10: ten useless meaningless words here just to fill up space         10            5          ten useless meaningless words here
             output
     1:            
     2:       words
     3:        here
     4:       words
     5:        here
     6: meaningless
     7:       words
     8:          to
     9:        fill
    10:        just

不幸的是,我得到的是:

                                                      string word.count stop.at.word input output
 1:                                                      one          1            0             
 2:                                                two words          2            1    NA     NA
 3:                                         three words here          3            2    NA     NA
 4:                                  four useless words here          4            2    NA     NA
 5:                      five useless meaningless words here          5            4    NA     NA
 6:                  six useless meaningless words here just          6            3    NA     NA
 7:             seven useless meaningless words here just to          7            3    NA     NA
 8:        eigth useless meaningless words here just to fill          8            6    NA     NA
 9:      nine useless meaningless words here just to fill up          9            7    NA     NA
10: ten useless meaningless words here just to fill up space         10            5  ten      NA

注意结果不一致,第1行为空字符串,第10行返回“10”。

这是我正在使用的代码:

    texts.dt[, c("input", "output") := .(
        substr(string, 
               1, 
               sapply(gregexpr(" ", string),"[", stop.at.word) - 1),
        substr(string, 
               sapply(gregexpr(" ", string),"[", stop.at.word), 
               sapply(gregexpr(" ", string),"[", stop.at.word + 1) - 1)
    )]

我运行了很多测试,当我在控制台中尝试单个字符串时,substr指令运行良好,但在应用于data.table时失败。我怀疑我遗漏了与data.table中的作用域相关的东西,但是我没有长时间使用这个包,所以我很困惑。

我非常感谢一些帮助。提前致谢!

答案

我可能会这样做

texts.dt[stop.at.word > 0, c("input","output") := {
  sp = strsplit(string, " ")
  list( 
    mapply(function(p,n) paste(p[seq_len(n)], collapse = " "), sp, stop.at.word),
    mapply(`[`, sp, stop.at.word+1L)
  )
}]

# partial result
head(texts.dt, 4)

                    string word.count stop.at.word        input output
1:                     one          1            0           NA     NA
2:               two words          2            1          two  words
3:        three words here          3            2  three words   here
4: four useless words here          4            2 four useless  words

交替:

library(stringi)
texts.dt[stop.at.word > 0, c("input","output") := {
  patt = paste0("((\w+ ){", stop.at.word-1, "}\w+) (.*)")
  m    = stri_match(string, regex = patt)
  list(m[, 2], m[, 4])
}]
另一答案
dt[, `:=`(input  = sub(paste0('((\s*\w+){', stop.at.word, '}).*'), '\1', string),
          output = sub(paste0('(\s*\w+){', stop.at.word, '}\s*(\w+).*'), '\2', string))
   , by = stop.at.word][]
#                                                      string word.count stop.at.word
# 1:                                                      one          1            0
# 2:                                                two words          2            1
# 3:                                         three words here          3            2
# 4:                                  four useless words here          4            2
# 5:                      five useless meaningless words here          5            4
# 6:                  six useless meaningless words here just          6            3
# 7:             seven useless meaningless words here just to          7            3
# 8:        eigth useless meaningless words here just to fill          8            6
# 9:      nine useless meaningless words here just to fill up          9            7
#10: ten useless meaningless words here just to fill up space         10            5
#                                          input output
# 1:                                                one
# 2:                                         two  words
# 3:                                 three words   here
# 4:                                four useless  words
# 5:              five useless meaningless words   here
# 6:                     six useless meaningless  words
# 7:                   seven useless meaningless  words
# 8:   eigth useless meaningless words here just     to
# 9: nine useless meaningless words here just to   fill
#10:          ten useless meaningless words here   just

我不确定我是否理解output对于第一线没有任何意义的逻辑,但如果确实需要,那么微不足道的解决方案将留给OP。

另一答案

@ Frank的mapply解决方案的替代方案是使用by = 1:nrow(texts.dt)strsplitpaste

library(data.table)
texts.dt[, `:=` (input = paste(strsplit(string, ' ')[[1]][1:stop.at.word][stop.at.word>0],
                               collapse = " "),
                 output = strsplit(string, ' ')[[1]][stop.at.word + 1]),
         by = 1:nrow(texts.dt)]

这使:

> texts.dt
                                                      string word.count stop.at.word                                       input output
 1:                                                      one          1            0                                                one
 2:                                                two words          2            1                                         two  words
 3:                                         three words here          3            2                                 three words   here
 4:                                  four useless words here          4            2                                four useless  words
 5:                      five useless meaningless words here          5            4              five useless meaningless words   here
 6:                  six useless meaningless words here just          6            3                     six useless meaningless  words
 7:             seven useless meaningless words here just to          7            3                   seven useless meaningless  words
 8:        eigth useless meaningless words here just to fill          8            6   eigth useless meaningless words here just     to
 9:      nine useless meaningless words here just to fill up          9            7 nine useless meaningless words here just to   fill
10: ten useless meaningless words here just to fill up space         10            5          ten useless meaningless words here   just

你可以将[[1]]包裹在strsplit中,而不是使用unlist,如下所示:unlist(strsplit(string, ' '))(而不是strsplit(string, ' ')[[1]])。这将给你相同的结果。


另外两个选择:

1)使用stringi包:

library(stringi)
texts.dt[, `:=`(input = paste(stri_extract_all_words(string[stop.at.word>0],
                                                     simplify = TRUE)[1:stop.at.word],
                              collapse = " "),
                output = stri_extract_all_words(string[stop.at.word>0],
                                                simplify = TRUE)[stop.at.word+1]),
         1:nrow(texts.dt)]

2)或来自this answer的改编:

texts.dt[stop.at.word>0, 
         c('input','output') := tstrsplit(string, 
                                          split = paste0("(?=(?>\s+\S*){",
                                                         word.count - stop.at.word,
                                                         "}$)\s"), 
                                          perl = TRUE)
         ][, output := sub('(\w+).*','\1',output)]

两者都给:

> texts.dt
                                                      string word.count stop.at.word                                       input output
 1:                                                      one          1            0                                          NA     NA
 2:                                                two words          2            1                                         two  words
 3:                                         three words here          3            2                                 three words   here
 4:                                  four useless words here          4            2                                four useless  words
 5:                      five useless meaningless words here          5            4              five useless meaningless words   here
 6:                  six useless meaningless words here just          6            3                     six useless meaningless  words
 7:             seven useless meaningless words here just to          7            3                   seven useless meaningless  words
 8:        eigth useless meaningless words here just to fill          8            6   eigth useless meaningless words here just     to
 9:      nine useless meaningless words here just to fill up          9            7 nine useless meaningless words here just to   fill
10: ten useless meaningless words here just to fill up space         10            5          ten useless meaningless words here   just

以上是关于字符串拆分data.table列生成NA的主要内容,如果未能解决你的问题,请参考以下文章

在双错误类型的连接列中使用 NA 的 data.table 内部/外部连接?

在双错误类型的连接列中使用 NA 的 data.table 内部/外部连接?

根据模式将data.table列拆分为许多未知数量的列

从R中的data.table中删除带有NA的行[重复]

如何替换表*中的NA值以用于所选列*? data.frame,data.table

R语言data.table导入数据实战:data.table生成新的数据列(基于已有数据列)生成多个数据列