如何在 Twitter 文本数据上使用 unnest_token?

Posted

技术标签:

【中文标题】如何在 Twitter 文本数据上使用 unnest_token?【英文标题】:How to use unnest_token on twitter text data? 【发布时间】:2020-04-11 17:13:22 【问题描述】:

我正在尝试运行以下命令并给我一条错误消息。

data <- c("Who said we cant have a lil dance party while were stuck in Quarantine? Happy Friday Cousins!! We got through another week of Quarantine. Lets continue to stay safe, healthy and make the best of the situation.  . . Video:  . . -  #blackgirlstraveltoo #everydayafrica #travelnoire #blacktraveljourney #essencetravels #africanculture #blacktravelfeed #blacktravel #melanintravel #ethiopia #representationmatters #blackcommunity #Moyoafrika #browngirlbloggers #travelafrica #blackgirlskillingit #passportstamps #blacktravelista #blackisbeautiful #weworktotravel #blackgirlsrock #mytravelcrush #blackandabroad #blackgirlstravel #blacktravel #africanamerican #africangirlskillingit #africanmusic #blacktravelmovement #blacktravelgram",
      "#Copingwiththelockdown... Festac town, Lagos.  #covid19 #streetphotography #urbanphotography #copingwiththelockdown #documentaryphotography #hustlingandbustling #cityscape #coronavirus #busyroad #everydaypeople #everydaylife #commute #lagosroad #lagosmycity #nigeria #africa #westafrica #lagos #hustle #people #strength #faith #nopoverty #everydayeverywhere #everydayafrica #everydaylagos #nohunger #chroniclesofonyinye",
      "Peace Everywhere. Amani Kila Pahali. Photo by Adan Galma  . * * * * * * #matharestories #mathare #adangalma #everydaymathare #everydayeverywhere #everydayafrica #peace #amani #knowmathare #streets #spi_street #mathareslums")
data_df <- as.data.frame(data)
remove_reg <- "&amp;|&lt;|&gt;"
tidy_data <- data_df %>% 
mutate(text = str_remove_all(text, remove_reg)) %>%
unnest_tokens(word, text, token = "data_df") %>%
filter(!word %in% stop_words$word,
     !word %in% str_remove_all(stop_words$word, "'"),
     str_detect(word, "[a-z]"))

它给了我以下错误信息:

stri_replace_all_regex(string, pattern, fix_replacement(replacement), 中的错误: 参数str 应该是一个字符向量(或一个可强制转换的对象)"

我该如何解决?

【问题讨论】:

【参考方案1】:

主要问题是您将文本列命名为data,但后来将其称为text。试试类似这样的东西:

library(tidyverse)
library(tidytext)

text <- c("Who said we cant have a lil dance party while were stuck in Quarantine? Happy Friday Cousins!! We got through another week of Quarantine. Lets continue to stay safe, healthy and make the best of the situation.  . . Video:  . . -  #blackgirlstraveltoo #everydayafrica #travelnoire #blacktraveljourney #essencetravels #africanculture #blacktravelfeed #blacktravel #melanintravel #ethiopia #representationmatters #blackcommunity #Moyoafrika #browngirlbloggers #travelafrica #blackgirlskillingit #passportstamps #blacktravelista #blackisbeautiful #weworktotravel #blackgirlsrock #mytravelcrush #blackandabroad #blackgirlstravel #blacktravel #africanamerican #africangirlskillingit #africanmusic #blacktravelmovement #blacktravelgram",
          "#Copingwiththelockdown... Festac town, Lagos.  #covid19 #streetphotography #urbanphotography #copingwiththelockdown #documentaryphotography #hustlingandbustling #cityscape #coronavirus #busyroad #everydaypeople #everydaylife #commute #lagosroad #lagosmycity #nigeria #africa #westafrica #lagos #hustle #people #strength #faith #nopoverty #everydayeverywhere #everydayafrica #everydaylagos #nohunger #chroniclesofonyinye",
          "Peace Everywhere. Amani Kila Pahali. Photo by Adan Galma  . * * * * * * #matharestories #mathare #adangalma #everydaymathare #everydayeverywhere #everydayafrica #peace #amani #knowmathare #streets #spi_street #mathareslums")
data_df <- tibble(text)

remove_reg <- "&amp;|&lt;|&gt;"

data_df %>% 
  mutate(text = str_remove_all(text, remove_reg)) %>%
  unnest_tokens(word, text) %>%
  anti_join(get_stopwords()) %>%
  filter(str_detect(word, "[a-z]"))
#> Joining, by = "word"
#> # A tibble: 105 x 1
#>    word      
#>    <chr>     
#>  1 said      
#>  2 cant      
#>  3 lil       
#>  4 dance     
#>  5 party     
#>  6 stuck     
#>  7 quarantine
#>  8 happy     
#>  9 friday    
#> 10 cousins   
#> # … with 95 more rows

如果您对 Twitter 数据特别感兴趣,请考虑使用token = "tweets"

data_df %>% 
  unnest_tokens(word, text, token = "tweets")
#> Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
#> # A tibble: 121 x 1
#>    word 
#>    <chr>
#>  1 who  
#>  2 said 
#>  3 we   
#>  4 cant 
#>  5 have 
#>  6 a    
#>  7 lil  
#>  8 dance
#>  9 party
#> 10 while
#> # … with 111 more rows

由reprex package (v0.3.0) 于 2020-04-12 创建

此选项可以很好地处理主题标签和用户名。

【讨论】:

以上是关于如何在 Twitter 文本数据上使用 unnest_token?的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 Tweepy 获取 Twitter 生物信息

jQuery / Twitter Bootstrap 数据加载文本按钮延迟

如何在 HTML 文本输入字段周围放置类似 Twitter 的淡入发光?

使用文本挖掘技术分析Twitter用户对电影Rangoon的评价

如何在 R 中清理 twitter 数据?

将使用 Python 从 Twitter 检索到的数据保存到文本文件中?