使用 Gensim 获取三元组的问题

Posted 2023-02-19

技术标签:

【中文标题】使用 Gensim 获取三元组的问题【英文标题】：Issues in getting trigrams using Gensim 【发布时间】：2018-02-19 06:02:16 【问题描述】：

我想从我提到的例句中得到二元组和三元组。

我的代码适用于二元组。但是，它并没有捕获数据中的三元组（例如，人机交互，这在我的句子中的 5 处提到）

方法 1 下面提到的是我在 Gensim 中使用 Phrases 的代码。

from gensim.models import Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, threshold=1, delimiter=b' ')
trigram = Phrases(bigram_phraser[sentence_stream])

for sent in sentence_stream:
    bigrams_ = bigram_phraser[sent]
    trigrams_ = trigram[bigrams_]

    print(bigrams_)
    print(trigrams_)

方法 2 我什至尝试同时使用 Phraser 和 Phraser，但没有成功。

from gensim.models import Phrases
from gensim.models.phrases import Phraser
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)
trigram = Phrases(bigram_phraser[sentence_stream])

for sent in sentence_stream:
    bigrams_ = bigram_phraser[sent]
    trigrams_ = trigram[bigrams_]

    print(bigrams_)
    print(trigrams_)

请帮我解决这个获取三元组的问题。

我正在关注 Gensim 的example documentation。

【问题讨论】：

【参考方案1】：

通过对您的代码进行一些修改，我能够获得二元组和三元组：

from gensim.models import Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, delimiter=b' ')
trigram = Phrases(bigram[sentence_stream], min_count=1, delimiter=b' ')

for sent in sentence_stream:
    bigrams_ = [b for b in bigram[sent] if b.count(' ') == 1]
    trigrams_ = [t for t in trigram[bigram[sent]] if t.count(' ') == 2]

    print(bigrams_)
    print(trigrams_)

我从二元组Phrases 中删除了threshold = 1 参数，因为否则它似乎形成了允许构造奇怪三元组的奇怪二元组（注意bigram 用于构建三元组Phrases）；当您有更多数据时，此参数可能会很有用。对于三元组，还需要指定 min_count 参数，因为如果不提供，则默认为 5。

为了检索每个文档的二元组和三元组，您可以使用这个列表理解技巧来分别过滤不是由两个或三个单词组成的元素。

编辑 - 关于threshold 参数的一些细节：

估计器使用此参数来确定两个单词 a 和 b 是否构成一个短语，并且仅在以下情况下：

(count(a followed by b) - min_count) * N/(count(a) * count(b)) > threshold

其中 N 是总词汇量。默认情况下，参数值为 10（请参阅docs）。因此，threshold 越高，单词形成短语的约束就越难。

例如，在您的第一种方法中，您尝试使用threshold = 1，因此您将得到['human computer','interaction is'] 作为您5 个以“人机交互”开头的句子中的3 个的二字组；那个奇怪的第二个图表是更宽松的阈值的结果。

然后，当您尝试使用默认 threshold = 10 获取三元组时，您只会为这 3 个句子获得 ['human computer interaction is']，而其余两个则没有（按阈值过滤）；因为那是一个 4-gram 而不是三元组，所以它也会被 if t.count(' ') == 2 过滤。例如，如果将 trigram 阈值降低到 1，则可以将 ['human computer interaction'] 作为剩余两个句子的 trigram。获得良好的参数组合似乎并不容易，here's 更多关于它。

我不是专家，所以对这个结论持保留态度：我认为最好先获得好的 digram 结果（不像“交互是”），然后再继续，因为奇怪的 digram 会进一步增加混乱三元组，4-gram...

【讨论】：

非常感谢您非常宝贵的回答。干杯! :) 顺便问一下，能否告诉我threshold 值会发生什么，因为我不太清楚？不客气！是的，我编辑了答案，希望现在它更清楚了。非常感谢！发现您的答案非常有用:) gensim 并不明显delimiter=b' ' 必须是二进制格式。谢谢你。如何将它用于训练和测试数据？它没有任何像 scikit learn vectorizers 这样的 fit and transform 方法。

以上是关于使用 Gensim 获取三元组的问题的主要内容，如果未能解决你的问题，请参考以下文章

创建一个包含三元组的 PriorityQueue，并返回 Scala 中的最小第三个元素？

创建所有可能的三元组的多维数组

关于稀疏矩阵三元组的转置

数据结构与算法三元组的代码实现

算法leetcode｜6136. 算术三元组的数目（rust和go全部双百）