unnest_tokens 及其错误（“”）

Posted 2023-04-12

技术标签:

【中文标题】unnest_tokens 及其错误（“”）【英文标题】：unnest_tokens and its error("") 【发布时间】：2017-07-20 16:37:54 【问题描述】：

我正在使用 tidytext。当我命令 unnest_tokens 时。 R返回错误

请提供列名

我该如何解决这个错误？

library(tidytext)
library(tm)
library(dplyr)
library(stats)
library(base)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
  #Build a corpus: a collection of statements
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
f <-Corpus(DirSource("C:/Users/Boon/Desktop/Dissertation/F"))
doc_dir <- "C:/Users/Boon/Desktop/Dis/F/f.csv"
doc <- read.csv(file_loc, header = TRUE)
docs<- Corpus(DataframeSource(doc))
dtm <- DocumentTermMatrix(docs)
text_df<-data_frame(line=1:115,docs=docs)

#This is the output from the code above,which is fine!: 
# text_df
# A tibble: 115 x 2
#line          docs
#<int> <S3: VCorpus>
# 1      1 <S3: VCorpus>
#2      2 <S3: VCorpus>
#3      3 <S3: VCorpus>
#4      4 <S3: VCorpus>
#5      5 <S3: VCorpus>
#6      6 <S3: VCorpus>
#7      7 <S3: VCorpus>
#8      8 <S3: VCorpus>
#9      9 <S3: VCorpus>
#10    10 <S3: VCorpus>
# ... with 105 more rows

unnest_tokens(word, docs)

# Error: Please supply column name

【问题讨论】：

***.com/help/mcve 你需要用第一个参数引用数据，像这样unnest_tokens(tib = text_df, output = words, input = docs) 亲爱的 Nate，非常感谢您的帮助。它似乎工作。但是，它会产生一些错误，如下所示 unnest_tokens_(tbl, output_col, input_col, token = token, to_lower = to_lower, 中的错误：unnest_tokens 期望输入的所有列都是原子向量（而不是列表）发生这种情况是因为您的 tibble 在 docs 列中包含语料库，因此在使用 unnest_tokens 时将其视为列表。正如错误消息所说，您的列文档需要是原子向量。 【参考方案1】：

如果您想将文本数据转换为整洁的格式，则无需先将其转换为语料库或文档术语矩阵或其他任何东西。这是使用整洁的文本数据格式背后的主要思想之一；你不使用那些其他格式，除非你需要建模。

您只需将原始文本放入数据框中，然后使用unnest_tokens() 对其进行整理。（我在这里对您的 CSV 的外观做出一些假设；下次发布 reproducible example 会更有帮助。）

library(dplyr)

docs <- data_frame(line = 1:4,
                   document = c("This is an excellent document.",
                                "Wow, what a great set of words!",
                                "Once upon a time...",
                                "Happy birthday!"))

docs
#> # A tibble: 4 x 2
#>    line                        document
#>   <int>                           <chr>
#> 1     1  This is an excellent document.
#> 2     2 Wow, what a great set of words!
#> 3     3             Once upon a time...
#> 4     4                 Happy birthday!

library(tidytext)

docs %>%
    unnest_tokens(word, document)
#> # A tibble: 18 x 2
#>     line      word
#>    <int>     <chr>
#>  1     1      this
#>  2     1        is
#>  3     1        an
#>  4     1 excellent
#>  5     1  document
#>  6     2       wow
#>  7     2      what
#>  8     2         a
#>  9     2     great
#> 10     2       set
#> 11     2        of
#> 12     2     words
#> 13     3      once
#> 14     3      upon
#> 15     3         a
#> 16     3      time
#> 17     4     happy
#> 18     4  birthday

【讨论】：

如果您确实已经将数据保存在文档术语矩阵中（例如，来自 tm），那么您要做的是 tidy() 它，而不是使用 unnest_tokens()。非常感谢朱莉娅 :)

以上是关于unnest_tokens 及其错误（“”）的主要内容，如果未能解决你的问题，请参考以下文章