从 R 中的语料库或数据框中删除英语以外的语言
Posted
技术标签:
【中文标题】从 R 中的语料库或数据框中删除英语以外的语言【英文标题】:Remove languages other than English from corpus or data frame in R 【发布时间】:2018-08-26 13:43:28 【问题描述】:我目前正在寻找对 25000 个 YouTube cmets 执行一些文本挖掘,我使用 tuber
包收集了这些内容。我对编码非常陌生,并且有所有这些不同的信息,有时这可能有点让人不知所措。所以我已经清理了我创建的语料库:
# Build a corpus, and specify the source to be character vectors
corpus <- Corpus(VectorSource(comments_final$textOriginal))
# Convert to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeURL))
# Remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeNumPunct))
# Add extra stopwords
myStopwords <- c(stopwords('english'),"im", "just", "one","youre",
"hes","shes","its","were","theyre","ive","youve","weve","theyve","id")
# Remove stopwords from corpus
corpus <- tm_map(corpus, removeWords, myStopwords)
# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove other languages or more specifically anything with a non "a-z""0-9" character
corpus <- tm_map(corpus, content_transformer(function(s)
gsub(pattern = '[^a-zA-Z0-9\\s]+',
x = s,
replacement = " ",
ignore.case = TRUE,
perl = TRUE)))
# Replace word elongations using the textclean package by Tyler Rinker.
corpus <- tm_map(corpus, replace_word_elongation)
# Creating data frame from corpus
corpus_asdataframe<-data.frame(text = sapply(corpus, as.character),stringsAsFactors = FALSE)
# Due to pre-processing some rows are empty. Therefore, the empty rows should be removed.
# Remove empty rows from data frame and "NA's"
corpus_asdataframe <-corpus_asdataframe[!apply(is.na(corpus_asdataframe) | corpus_asdataframe == "", 1, all),]
corpus_asdataframe<-as.data.frame(corpus_asdataframe)
# Create corpus of clean data frame
corpus <- Corpus(VectorSource(corpus_asdataframe$corpus_asdataframe))
所以现在的问题是我的语料库中有很多西班牙语或德语 cmets,我想将其排除在外。我想也许可以下载一本英文词典,也许可以使用inner join
来检测英文单词并删除所有其他语言。但是,我对编码非常是新手(我正在学习工商管理,从来不需要对计算机科学做任何事情),所以我的技能不足以将我的想法应用到我的语料库(或数据框) .我真的希望在这里找到一点帮助。我将非常感激!谢谢你,来自德国的问候!
【问题讨论】:
【参考方案1】:dftest <- data.frame(
id = 1:3,
text = c(
"Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"
)
)
library("cld3")
subset(dftest, detect_language(dftest$text) == "en")
## id text
## 1 1 Holla this is a spanish word
## 2 2 English online here
信用:Ken Benoit:Find in a dfm non-english tokens and remove them
【讨论】:
感谢斯坦尼斯拉夫·伊万诺夫。这是我关于堆栈溢出的第一篇文章。尽管我按照说明进行操作,但它看起来确实很丑。谢谢整理。下次会努力做得更好。以上是关于从 R 中的语料库或数据框中删除英语以外的语言的主要内容,如果未能解决你的问题,请参考以下文章