tm::findAssocs 的数学这个函数是如何工作的？

Posted 2023-02-19

技术标签:

【中文标题】tm::findAssocs 的数学这个函数是如何工作的？【英文标题】：Math of tm::findAssocs how does this function work? 【发布时间】：2012-12-25 09:18:43 【问题描述】：

我一直在使用 findAssoc() 和文本挖掘（tm 包），但我意识到我的数据集似乎有些不对劲。

我的数据集是保存在一列 csv 文件中的 1500 个开放式答案。所以我像这样调用数据集并使用典型的tm_map 使其进入语料库。

library(tm)
Q29 <- read.csv("favoritegame2.csv")
corpus <- Corpus(VectorSource(Q29$Q29))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
dtm<- DocumentTermMatrix(corpus)

findAssocs(dtm, "like", .2)
> cousin  fill  ....
  0.28    0.20

第一季度。当我找到与 like 关联的术语时，我看不到输出 like = 1 作为输出的一部分。不过，

dtm.df <-as.data.frame(inspect(dtm))

这个数据框由 1500 个 obs 组成。 1689个变量..（或者是因为数据保存在一行csv文件中？）

第二季度。即使cousin 和fill 出现一次，而目标词like 出现一次，但得分是不同的。他们不应该一样吗？

我正在尝试找到findAssoc() 的数学运算，但还没有成功。任何建议都非常感谢！

【问题讨论】：

CRAN 上没有“文本挖掘”包。请包含您使用的 library() 或 require() 调用。 @Dwin - 似乎在包 'tm' - inside-r.org/packages/cran/tm/docs/findAssocs @thelatemail - 感谢您的编辑！ 【参考方案1】：

我认为没有人回答你的最后一个问题。

我正在尝试查找 findAssoc() 的数学运算，但还没有成功。任何非常感谢您的建议！

findAssoc() 的数学运算基于 R 的 stats 包中的标准函数 cor()。给定两个数值向量，cor() 计算它们的协方差除以两个标准差。

因此，给定一个包含术语“word1”和“word2”的 DocumentTermMatrix dtm，这样 findAssocs(dtm,“word1”, 0) 返回值为 x 的“word2”，“word1”的术语向量的相关性和“word2”是 x。

举个冗长的例子

> data <-  c("", "word1", "word1 word2","word1 word2 word3","word1 word2 word3 word4","word1 word2 word3 word4 word5") 
> dtm <- DocumentTermMatrix(VCorpus(VectorSource(data)))
> as.matrix(dtm)
    Terms
Docs word1 word2 word3 word4 word5
   1     0     0     0     0     0
   2     1     0     0     0     0
   3     1     1     0     0     0
   4     1     1     1     0     0
   5     1     1     1     1     0
   6     1     1     1     1     1
> findAssocs(dtm, "word1", 0) 
$word1
word2 word3 word4 word5 
 0.63  0.45  0.32  0.20 

> cor(as.matrix(dtm)[,"word1"], as.matrix(dtm)[,"word2"])
[1] 0.6324555
> cor(as.matrix(dtm)[,"word1"], as.matrix(dtm)[,"word3"])
[1] 0.4472136

单词 4 和 5 以此类推。

另见http://r.789695.n4.nabble.com/findAssocs-tt3845751.html#a4637248

【讨论】：

我发现的一个警告是findAssocs 需要一个相关限制，即>=0。底层的cor 可能会返回负值来表示关系的方向，但似乎通过findAssocs 是不可能的。【参考方案2】：

顺便说一句，如果您的术语文档矩阵非常大，您可能想试试这个版本的findAssocs：

# u is a term document matrix (transpose of a DTM)
# term is your term
# corlimit is a value -1 to 1

findAssocsBig <- function(u, term, corlimit)
  suppressWarnings(x.cor <-  gamlr::corr(t(u[ !u$dimnames$Terms == term, ]),        
                                         as.matrix(t(u[  u$dimnames$Terms == term, ]))  ))  
  x <- sort(round(x.cor[(x.cor[, term] > corlimit), ], 2), decreasing = TRUE)
  return(x)

这样做的好处是它使用不同的方法将 TDM 转换为矩阵tm:findAssocs。这种不同的方法更有效地使用内存，这意味着您可以使用比tm:findAssocs 可以处理的大型 TDM（或 DTM）。当然，使用足够大的 TDM/DTM 时，您也会收到有关使用此功能分配内存的错误。

【讨论】：

【参考方案3】：

您的 dtm 有 1689 个变量，因为这是您观察中唯一词的数量（不包括停用词和数字）。可能“喜欢”这个词出现在您的 1500 次观察中不止一次出现，并且并不总是伴随着“表亲”和“填充”。你数过“赞”出现了多少次吗？

【讨论】：

【参考方案4】：

 findAssocs
#function (x, term, corlimit) 
#UseMethod("findAssocs", x)
#<environment: namespace:tm>

methods(findAssocs )
#[1] findAssocs.DocumentTermMatrix* findAssocs.matrix*   findAssocs.TermDocumentMatrix*

 getAnywhere(findAssocs.DocumentTermMatrix)
#-------------
A single object matching ‘findAssocs.DocumentTermMatrix’ was found
It was found in the following places
  registered S3 method for findAssocs from namespace tm
  namespace:tm
with value

function (x, term, corlimit) 

    ind <- term == Terms(x)
    suppressWarnings(x.cor <- cor(as.matrix(x[, ind]), as.matrix(x[, 
        !ind])))

这就是自引用被删除的地方。

    findAssocs(x.cor, term, corlimit)

<environment: namespace:tm>
#-------------
 getAnywhere(findAssocs.matrix)
#-------------
A single object matching ‘findAssocs.matrix’ was found
It was found in the following places
  registered S3 method for findAssocs from namespace tm
  namespace:tm
with value

function (x, term, corlimit) 
sort(round(x[term, which(x[term, ] > corlimit)], 2), decreasing = TRUE)
<environment: namespace:tm>

【讨论】：

以上是关于tm::findAssocs 的数学这个函数是如何工作的？的主要内容，如果未能解决你的问题，请参考以下文章

开始时间小于结束时间 WdatePicker这个控件是如何处理的

tm::findAssocs 的数学 这个函数是如何工作的？

tm::findAssocs 的数学这个函数是如何工作的？