在text2vec中使用hash_vectorizer的ngrams
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了在text2vec中使用hash_vectorizer的ngrams相关的知识,希望对你有一定的参考价值。
我试图在text2vec中使用hash_vectorizer函数创建ngrams,当我注意到它没有改变我的dtm更改值的维度。
h_vectorizer = hash_vectorizer(hash_size = 2 ^ 14, ngram = c(2L, 10L))
dtm_train = create_dtm(it_train, h_vectorizer)
dim(dtm_train)
在上面的代码中,尺寸不会改变,无论是2-10还是9-10。
vocab = create_vocabulary(it_train, ngram = c(1L, 4L))
ngram_vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, ngram_vectorizer)
在上面的代码中,维度发生了变化,但我也想使用hash_vectorizor,因为它节省了空间。我该如何使用它?
使用散列时,可以预先设置输出矩阵的大小。你是通过设置hash_size = 2 ^ 14
来实现的。这与模型中指定的ngram窗口无关。但是,输出矩阵内的计数会发生变化。
(回应下面的评论:)下面你会找到一个带有两个非常简单的字符串的最小例子来演示hash_vectorizer
中使用的两个不同ngram窗口的不同输出。对于bigrams案例,我添加了vocab_vectorizer
的输出矩阵进行比较。您意识到必须设置足够大的散列大小以考虑所有术语。如果它太小,则各个术语的哈希值可能会发生冲突。
关于你总是必须比较vocab_vectorizer
方法和hash_vectorizer
方法的输出的评论导致错误的方向,因为你将失去可能由散列方法产生的效率/记忆优势,这避免了产生词汇。根据您的数据和所需的输出,散列可能会将效率(以及dtm中术语的可解释性)与效率进行对比。因此,如果散列是合理的,它取决于您的用例(对于大型集合,它尤其适用于文档级别的分类任务)。
我希望这能让您对哈希以及您能够或不能从中得到什么有一个大概的了解。您还可以在quora,Wikipedia(或here)查看有关散列的一些帖子。或者也可以参考text2vec.org上列出的详细原始资料。
library(text2vec)
txt <- c("a string string", "and another string")
it = itoken(txt, progressbar = F)
#the following four example demonstrate the effect of the size of the hash
#and the use of signed hashes (i.e. the use of a secondary hash function to reduce risk of collisions)
vectorizer_small = hash_vectorizer(2 ^ 2, c(1L, 1L)) #unigrams only
hash_dtm_small = create_dtm(it, vectorizer_small)
as.matrix(hash_dtm_small)
# [,1] [,2] [,3] [,4]
# 1 2 0 0 1
# 2 1 2 0 0 #collision of the hash values of and / another
vectorizer_small_signed = hash_vectorizer(2 ^ 2, c(1L, 1L), signed_hash = TRUE) #unigrams only
hash_dtm_small = create_dtm(it, vectorizer_small_signed)
as.matrix(hash_dtm_small)
# [,1] [,2] [,3] [,4]
# 1 2 0 0 1
# 2 1 0 0 0 #no collision but some terms (and / another) not represented as hash value
vectorizer_medium = hash_vectorizer(2 ^ 3, c(1L, 1L)) #unigrams only
hash_dtm_medium = create_dtm(it, vectorizer_medium)
as.matrix(hash_dtm_medium)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# 1 0 0 0 1 2 0 0 0
# 2 0 1 0 0 1 1 0 0 #no collision, all terms represented by hash values
vectorizer_medium = hash_vectorizer(2 ^ 3, c(1L, 1L), signed_hash = TRUE) #unigrams only
hash_dtm_medium = create_dtm(it, vectorizer_medium)
as.matrix(hash_dtm_medium)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# 1 0 0 0 1 2 0 0 0
# 2 0 -1 0 0 1 1 0 0 #no collision, all terms represented as hash values
#in addition second hash function generated a negative hash value
#the following two examples deomstrate the difference between
#two hash vectorizers one with unigrams, one allowing for bigrams
#and one vocab vectorizer with bigrams
vectorizer = hash_vectorizer(2 ^ 4, c(1L, 1L)) #unigrams only
hash_dtm = create_dtm(it, vectorizer)
as.matrix(hash_dtm)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
# 1 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0
# 2 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0
vectorizer2 = hash_vectorizer(2 ^ 4, c(1L, 2L)) #unigrams + bigrams
hash_dtm2 = create_dtm(it, vectorizer2)
as.matrix(hash_dtm2)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
# 1 1 0 0 1 0 0 0 0 0 0 0 1 2 0 0 0
# 2 0 0 0 0 0 1 1 0 0 1 0 0 1 1 0 0
v <- create_vocabulary(it, c(1L, 2L))
vectorizer_v = vocab_vectorizer(v) #unigrams + bigrams
v_dtm = create_dtm(it, vectorizer_v)
as.matrix(v_dtm)
# a_string and_another a another and string_string another_string string
# 1 1 0 1 0 0 1 0 2
# 2 0 1 0 1 1 0 1 1
sum(Matrix::colSums(as.matrix(hash_dtm)) > 0)
#[1] 4 - these are the four unigrams a, string, and, another
sum(Matrix::colSums(hash_dtm2) > 0)
#[1] 8 - these are the four unigrams as above plus the 4 bigrams string_string, a_string, and_another, another_string
sum(Matrix::colSums(v_dtm) > 0)
#[1] 8 - same as hash_dtm2
以上是关于在text2vec中使用hash_vectorizer的ngrams的主要内容,如果未能解决你的问题,请参考以下文章
我们应该在训练/测试拆分之前还是之后预处理文本数据? [关闭]
在 Observable RxSwift 中使用 'asPromise()' 可以在 PromiseKit Promise 中使用吗?