将最相似的余弦排名文档映射回我原始列表中的每个文档
Posted
技术标签:
【中文标题】将最相似的余弦排名文档映射回我原始列表中的每个文档【英文标题】:Map the most similar cosine ranking document back to each respective document in my original list 【发布时间】:2019-07-07 23:17:45 【问题描述】:我不知道如何将列表中最相似的 (#1) 文档映射回原始列表中的每个文档项。
我经历了一些预处理、ngram、词形还原和 TF IDF。然后我使用 Scikit 的线性内核。我尝试使用提取特征,但不确定如何在 csr 矩阵中使用它...
尝试了各种方法 (Using csr_matrix of items similarities to get most similar items to item X without having to transform csr_matrix to dense matrix)
import string, nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from sklearn.metrics.pairwise import cosine_similarity
import sparse_dot_topn.sparse_dot_topn as ct
import re
documents = 'the cat in the hat','the catty ate the hat','the cat wants the cats hat'
def ngrams(string, n=2):
string = re.sub(r'[,-./]|\sBD',r'', string)
ngrams = zip(*[string[i:] for i in range(n)])
return [''.join(ngram) for ngram in ngrams]
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, analyzer=ngrams, stop_words='english')
tfidf_matrix = TfidfVec.fit_transform(documents)
from sklearn.metrics.pairwise import linear_kernel
cosine_similarities = linear_kernel(tfidf_matrix[0:1], tfidf_matrix).flatten()
related_docs_indices = cosine_similarities.argsort()[:-5:-1]
cosine_similarities
我当前的示例仅让我获得所有文档的第一行。如何在数据框中获得类似这样的输出(注意原始文档来自数据框)。
original df col most similar doc similarity%
'the cat in the hat' 'the catty ate the hat' 80%
'the catty ate the hat' 'the cat in the hat' 80%
'the cat wants the cats hat' 'the catty ate the hat' 20%
【问题讨论】:
【参考方案1】:import pandas as pd
df = pd.DataFrame(columns=["original df col", "most similar doc", "similarity%"])
for i in range(len(documents)):
cosine_similarities = linear_kernel(tfidf_matrix[i:i+1], tfidf_matrix).flatten()
# make pairs of (index, similarity)
cosine_similarities = list(enumerate(cosine_similarities))
# delete the cosine similarity with itself
cosine_similarities.pop(i)
# get the tuple with max similarity
most_similar, similarity = max(cosine_similarities, key=lambda t:t[1])
df.loc[len(df)] = [documents[i], documents[most_similar], similarity]
结果:
original df col most similar doc similarity%
0 the cat in the hat the catty ate the hat 0.664119
1 the catty ate the hat the cat in the hat 0.664119
2 the cat wants the cats hat the cat in the hat 0.577967
【讨论】:
以上是关于将最相似的余弦排名文档映射回我原始列表中的每个文档的主要内容,如果未能解决你的问题,请参考以下文章