R Tm包字典匹配导致比文本实际单词更高的频率

Posted 2023-02-19

技术标签:

【中文标题】R Tm包字典匹配导致比文本实际单词更高的频率【英文标题】：R Tm package dictionary matching leads to higher frequency than actual words of text 【发布时间】：2021-07-15 18:33:03 【问题描述】：

我一直在使用下面的代码将文本加载为语料库并使用 tm 包来清理文本。作为下一步，我正在加载字典并对其进行清理。然后我将文本中的单词与字典匹配以计算分数。但是，匹配导致的匹配数高于文本中的实际单词（例如，能力得分为 1500，但文本中的实际单词数仅为 1000）。

我认为这与文本和字典的词干有关，因为当没有进行词干提取时匹配较低。

你知道为什么会这样吗？

非常感谢。

R 代码

第 1 步将数据存储为语料库

file.path <- file.path(here("Generated Files", "Data Preparation")) corpus <- Corpus(DirSource(file.path))

第 2 步清理数据

#Removing special characters
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "/")
corpus <- tm_map(corpus, toSpace, "@")
corpus <- tm_map(corpus, toSpace, "\\|") 

#Convert the text to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
#Remove numbers
corpus <- tm_map(corpus, removeNumbers)
#Remove english common stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
#Remove your own stop word
specify your stopwords as a character vector
corpus <- tm_map(corpus, removeWords, c("view", "pdf")) 
#Remove punctuations
corpus <- tm_map(corpus, removePunctuation)
#Eliminate extra white spaces
corpus <- tm_map(corpus, stripWhitespace)
#Text stemming
corpus <- tm_map(corpus, stemDocument)
#Unique words
corpus <- tm_map(corpus, unique)

第 3 步 DTM

dtm <- DocumentTermMatrix(corpus)

第 4 步加载字典

dic.competence <- read_excel(here("Raw Data", "6. Dictionaries", "Brand.xlsx"))
dic.competence <- tolower(dic.competence$COMPETENCE)
dic.competence <- stemDocument(dic.competence)
dic.competence <- unique(dic.competence)

第 5 步计数频率

corpus.terms = colnames(dtm)
competence = match(corpus.terms, dic.competence, nomatch=0)

第 6 步计算分数

competence.score = sum(competence) / rowSums(as.matrix(dtm))
competence.score.df = data.frame(scores = competence.score)

【问题讨论】：

【参考方案1】：

当您运行该行时，competence 会返回什么？我不确定你的字典是如何设置的，所以我不能确定那里发生了什么。我引入了自己的随机语料库文本作为主要文本，并引入了一个单独的语料库作为字典，您的代码运行良好。 competence.score.df的行名是我语料库中不同txt文件的名称，分数都在0-1的范围内。

# this is my 'dictionary' of terms:
tdm <- TermDocumentMatrix(Corpus(DirSource("./corpus/corpus3")),
                          control = list(removeNumbers = TRUE,
                                         stopwords = TRUE,
                                         stemming = TRUE,
                                         removePunctuation = TRUE))

# then I used your programming and it worked as I think you were expecting

# notice what I used here for the dictionary    
(competence = match(colnames(dtm), 
                    Terms(tdm)[1:10], # I only used the first 10 in my test of your code
                    nomatch = 0))

(competence.score = sum(competence)/rowSums(as.matrix(dtm)))
(competence.score.df = data.frame(scores = competence.score))

【讨论】：

亲爱的 Kat，非常感谢您提出的解决方案。我的字典设置为带有一列术语的普通 csv。我尝试运行您建议的解决方案，但它仍然给了我更高的能力，所以仍然发生了双重匹配，我无法弄清楚。但一开始我也没有将字典作为语料库阅读，这是一个很好的提示。你能提供一个字典csv中内容的结构样本吗？即使不是相同的数据，也许它会引导我或其他人为您提供有关如何解决问题的另一个想法。

以上是关于R Tm包字典匹配导致比文本实际单词更高的频率的主要内容，如果未能解决你的问题，请参考以下文章