如何在 Twitter 文本数据上使用 unnest_token?
Posted
技术标签:
【中文标题】如何在 Twitter 文本数据上使用 unnest_token?【英文标题】:How to use unnest_token on twitter text data? 【发布时间】:2020-04-11 17:13:22 【问题描述】:我正在尝试运行以下命令并给我一条错误消息。
data <- c("Who said we cant have a lil dance party while were stuck in Quarantine? Happy Friday Cousins!! We got through another week of Quarantine. Lets continue to stay safe, healthy and make the best of the situation. . . Video: . . - #blackgirlstraveltoo #everydayafrica #travelnoire #blacktraveljourney #essencetravels #africanculture #blacktravelfeed #blacktravel #melanintravel #ethiopia #representationmatters #blackcommunity #Moyoafrika #browngirlbloggers #travelafrica #blackgirlskillingit #passportstamps #blacktravelista #blackisbeautiful #weworktotravel #blackgirlsrock #mytravelcrush #blackandabroad #blackgirlstravel #blacktravel #africanamerican #africangirlskillingit #africanmusic #blacktravelmovement #blacktravelgram",
"#Copingwiththelockdown... Festac town, Lagos. #covid19 #streetphotography #urbanphotography #copingwiththelockdown #documentaryphotography #hustlingandbustling #cityscape #coronavirus #busyroad #everydaypeople #everydaylife #commute #lagosroad #lagosmycity #nigeria #africa #westafrica #lagos #hustle #people #strength #faith #nopoverty #everydayeverywhere #everydayafrica #everydaylagos #nohunger #chroniclesofonyinye",
"Peace Everywhere. Amani Kila Pahali. Photo by Adan Galma . * * * * * * #matharestories #mathare #adangalma #everydaymathare #everydayeverywhere #everydayafrica #peace #amani #knowmathare #streets #spi_street #mathareslums")
data_df <- as.data.frame(data)
remove_reg <- "&|<|>"
tidy_data <- data_df %>%
mutate(text = str_remove_all(text, remove_reg)) %>%
unnest_tokens(word, text, token = "data_df") %>%
filter(!word %in% stop_words$word,
!word %in% str_remove_all(stop_words$word, "'"),
str_detect(word, "[a-z]"))
它给了我以下错误信息:
stri_replace_all_regex(string, pattern, fix_replacement(replacement), 中的错误: 参数
str
应该是一个字符向量(或一个可强制转换的对象)"
我该如何解决?
【问题讨论】:
【参考方案1】:主要问题是您将文本列命名为data
,但后来将其称为text
。试试类似这样的东西:
library(tidyverse)
library(tidytext)
text <- c("Who said we cant have a lil dance party while were stuck in Quarantine? Happy Friday Cousins!! We got through another week of Quarantine. Lets continue to stay safe, healthy and make the best of the situation. . . Video: . . - #blackgirlstraveltoo #everydayafrica #travelnoire #blacktraveljourney #essencetravels #africanculture #blacktravelfeed #blacktravel #melanintravel #ethiopia #representationmatters #blackcommunity #Moyoafrika #browngirlbloggers #travelafrica #blackgirlskillingit #passportstamps #blacktravelista #blackisbeautiful #weworktotravel #blackgirlsrock #mytravelcrush #blackandabroad #blackgirlstravel #blacktravel #africanamerican #africangirlskillingit #africanmusic #blacktravelmovement #blacktravelgram",
"#Copingwiththelockdown... Festac town, Lagos. #covid19 #streetphotography #urbanphotography #copingwiththelockdown #documentaryphotography #hustlingandbustling #cityscape #coronavirus #busyroad #everydaypeople #everydaylife #commute #lagosroad #lagosmycity #nigeria #africa #westafrica #lagos #hustle #people #strength #faith #nopoverty #everydayeverywhere #everydayafrica #everydaylagos #nohunger #chroniclesofonyinye",
"Peace Everywhere. Amani Kila Pahali. Photo by Adan Galma . * * * * * * #matharestories #mathare #adangalma #everydaymathare #everydayeverywhere #everydayafrica #peace #amani #knowmathare #streets #spi_street #mathareslums")
data_df <- tibble(text)
remove_reg <- "&|<|>"
data_df %>%
mutate(text = str_remove_all(text, remove_reg)) %>%
unnest_tokens(word, text) %>%
anti_join(get_stopwords()) %>%
filter(str_detect(word, "[a-z]"))
#> Joining, by = "word"
#> # A tibble: 105 x 1
#> word
#> <chr>
#> 1 said
#> 2 cant
#> 3 lil
#> 4 dance
#> 5 party
#> 6 stuck
#> 7 quarantine
#> 8 happy
#> 9 friday
#> 10 cousins
#> # … with 95 more rows
如果您对 Twitter 数据特别感兴趣,请考虑使用token = "tweets"
:
data_df %>%
unnest_tokens(word, text, token = "tweets")
#> Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
#> # A tibble: 121 x 1
#> word
#> <chr>
#> 1 who
#> 2 said
#> 3 we
#> 4 cant
#> 5 have
#> 6 a
#> 7 lil
#> 8 dance
#> 9 party
#> 10 while
#> # … with 111 more rows
由reprex package (v0.3.0) 于 2020-04-12 创建
此选项可以很好地处理主题标签和用户名。
【讨论】:
以上是关于如何在 Twitter 文本数据上使用 unnest_token?的主要内容,如果未能解决你的问题,请参考以下文章
jQuery / Twitter Bootstrap 数据加载文本按钮延迟
如何在 HTML 文本输入字段周围放置类似 Twitter 的淡入发光?