python 用gensim进行文本相似度分析

Posted 2020-09-19

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python 用gensim进行文本相似度分析相关的知识，希望对你有一定的参考价值。

http://blog.csdn.net/chencheng126/article/details/50070021

参考于这个博主的博文。

原理

1、文本相似度计算的需求始于搜索引擎。

搜索引擎需要计算“用户查询”和爬下来的众多”网页“之间的相似度，从而把最相似的排在最前返回给用户。

2、主要使用的算法是tf-idf

tf：term frequency 词频

idf：inverse document frequency 倒文档频率

主要思想是：如果某个词或短语在一篇文章中出现的频率高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类。

第一步：把每个网页文本分词，成为词包（bag of words）。

第三步：统计网页（文档）总数M。

第三步：统计第一个网页词数N，计算第一个网页第一个词在该网页中出现的次数n，再找出该词在所有文档中出现的次数m。则该词的tf-idf 为：n/N * 1/(m/M) （还有其它的归一化公式，这里是最基本最直观的公式）

第四步：重复第三步，计算出一个网页所有词的tf-idf 值。

第五步：重复第四步，计算出所有网页每个词的tf-idf 值。

3、处理用户查询

第一步：对用户查询进行分词。

第二步：根据网页库（文档）的数据，计算用户查询中每个词的tf-idf 值。

4、相似度的计算

使用余弦相似度来计算用户查询和每个网页之间的夹角。夹角越小，越相似。

 1 #coding=utf-8
 2 
 3 
 4 # import warnings
 5 # warnings.filterwarnings(action=‘ignore‘, category=UserWarning, module=‘gensim‘)
 6 import logging
 7 from gensim import corpora, models, similarities
 8 
 9 datapath = ‘D:/hellowxc/python/testres0519.txt‘
10 querypath = ‘D:/hellowxc/python/queryres0519.txt‘
11 storepath = ‘D:/hellowxc/python/store0519.txt‘
12 def similarity(datapath, querypath, storepath):
13     logging.basicConfig(format=‘%(asctime)s : %(levelname)s : %(message)s‘, level=logging.INFO)
14 
15     class MyCorpus(object):
16         def __iter__(self):
17             for line in open(datapath):
18                 yield line.split()
19 
20     Corp = MyCorpus()
21     dictionary = corpora.Dictionary(Corp)
22     corpus = [dictionary.doc2bow(text) for text in Corp]
23 
24     tfidf = models.TfidfModel(corpus)
25 
26     corpus_tfidf = tfidf[corpus]
27 
28     q_file = open(querypath, ‘r‘)
29     query = q_file.readline()
30     q_file.close()
31     vec_bow = dictionary.doc2bow(query.split())
32     vec_tfidf = tfidf[vec_bow]
33 
34     index = similarities.MatrixSimilarity(corpus_tfidf)
35     sims = index[vec_tfidf]
36 
37     similarity = list(sims)
38 
39     sim_file = open(storepath, ‘w‘)
40     for i in similarity:
41         sim_file.write(str(i)+‘\\n‘)
42     sim_file.close()
43 similarity(datapath, querypath, storepath)