python gensim使用word2vec词向量处理英文语料

Posted 2020-07-10 竹聿Simon

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python gensim使用word2vec词向量处理英文语料相关的知识，希望对你有一定的参考价值。

word2vec介绍

word2vec官网：https://code.google.com/p/word2vec/

word2vec是google的一个开源工具，能够根据输入的词的集合计算出词与词之间的距离。
它将term转换成向量形式，可以把对文本内容的处理简化为向量空间中的向量运算，计算出向量空间上的相似度，来表示文本语义上的相似度。
word2vec计算的是余弦值，距离范围为0-1之间，值越大代表两个词关联度越高。
词向量：用Distributed Representation表示词，通常也被称为“Word Representation”或“Word Embedding（嵌入）”。

简言之：词向量表示法让相关或者相似的词，在距离上更接近。

具体使用

收集语料

本文：
网上的英文语料：http://mattmahoney.net/dc/text8.zip
语料训练信息：training on 85026035 raw words (62529137 effective words) took 197.4s, 316692 effective words/s

该语料编码格式UTF-8，存储为一行，长度很长……如下：
语料文本信息

注意：
理论上语料越大越好
理论上语料越大越好
理论上语料越大越好
重要的事情说三遍。
因为太小的语料跑出来的结果并没有太大意义。

word2vec使用

python，利用gensim模块。
win7系统下在通常的python基础上gensim模块不太好安装，所以建议使用anaconda，具体参见：python开发之anaconda【以及win7下安装gensim】

直接上代码——

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
功能：测试gensim使用
时间：2016年5月21日 18:07:50
"""

from gensim.models import word2vec
import logging

# 主程序
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.Text8Corpus(u"C:\\\\Users\\\\lenovo\\\\Desktop\\\\word2vec实验\\\\text8")  # 加载语料
model = word2vec.Word2Vec(sentences, size=200)  # 训练skip-gram模型; 默认window=5

# 计算两个词的相似度/相关程度
y1 = model.similarity("woman", "man")
print u"woman和man的相似度为：", y1
print "--------\\n"

# 计算某个词的相关词列表
y2 = model.most_similar("good", topn=20)  # 20个最相关的
print u"和good最相关的词有：\\n"
for item in y2:
    print item[0], item[1]
print "--------\\n"

# 寻找对应关系
print ' "boy" is to "father" as "girl" is to ...? \\n'
y3 = model.most_similar(['girl', 'father'], ['boy'], topn=3)
for item in y3:
    print item[0], item[1]
print "--------\\n"

more_examples = ["he his she", "big bigger bad", "going went being"]
for example in more_examples:
    a, b, x = example.split()
    predicted = model.most_similar([x, b], [a])[0][0]
    print "'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted)
print "--------\\n"

# 寻找不合群的词
y4 = model.doesnt_match("breakfast cereal dinner lunch".split())
print u"不合群的词：", y4
print "--------\\n"

# 保存模型，以便重用
model.save("text8.model")
# 对应的加载方式
# model_2 = word2vec.Word2Vec.load("text8.model")

# 以一种C语言可以解析的形式存储词向量
model.save_word2vec_format("text8.model.bin", binary=True)
# 对应的加载方式
# model_3 = word2vec.Word2Vec.load_word2vec_format("text8.model.bin", binary=True)

if __name__ == "__main__":
    pass

运行结果

woman和man的相似度为： 0.685955257368
--------

和good最相关的词有：

bad 0.739628911018
poor 0.563425064087
luck 0.525990724564
fun 0.520761489868
quick 0.518206238747
really 0.491045713425
practical 0.479608744383
helpful 0.478456377983
love 0.477012127638
simple 0.475951403379
useful 0.474674522877
reasonable 0.473541408777
safe 0.473105460405
you 0.47159832716
courage 0.470109701157
dangerous 0.469624102116
happy 0.468672126532
wrong 0.467448621988
easy 0.467320919037
sick 0.466005086899
--------

 "boy" is to "father" as "girl" is to ...? 

mother 0.770967006683
wife 0.718966007233
grandmother 0.700566351414
--------

'he' is to 'his' as 'she' is to 'her'
'big' is to 'bigger' as 'bad' is to 'worse'
'going' is to 'went' as 'being' is to 'was'
--------

不合群的词： cereal
--------

参考资料

深度学习：使用 word2vec 和 gensim：
http://www.open-open.com/lib/view/open1420687622546.html

以上是关于python gensim使用word2vec词向量处理英文语料的主要内容，如果未能解决你的问题，请参考以下文章