显示 k 个最近邻用于文本分类
Posted
技术标签:
【中文标题】显示 k 个最近邻用于文本分类【英文标题】:Show k nearest neighbors for text classification 【发布时间】:2020-05-14 08:54:04 【问题描述】:我在语料库中有一个 CSV 文件 (corpus.csv),其中包含以下格式的分级摘要(文本):
Institute, Score, Abstract
----------------------------------------------------------------------
UoM, 3.0, Hello, this is abstract one
UoM, 3.2, Hello, this is abstract two and yet counting.
UoE, 3.1, Hello, yet another abstract but this is a unique one.
UoE, 2.2, Hello, please no more abstract.
我正在尝试在 python 中创建一个 KNN 分类程序,该程序能够获取用户输入摘要,例如“这是一个新的独特摘要”,然后将此用户输入摘要分类为最接近语料库 (CSV) 和还返回预测摘要的分数/等级。我有以下代码:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import numpy as np
import pandas as pd
from csv import reader,writer
import operator as op
import string
from sklearn import neighbors
#Read data from corpus
r = reader(open('corpus.csv','r'))
abstract_list = []
score_list = []
institute_list = []
row_count = 0
for row in list(r)[1:]:
institute,score,abstract = row[0], row[1], row[2]
if len(abstract.split()) > 0:
institute_list.append(institute)
score = float(score)
score_list.append(score)
abstract = abstract.translate(string.punctuation).lower()
abstract_list.append(abstract)
row_count = row_count + 1
print("Total processed data: ", row_count)
#Vectorize (TF-IDF, ngrams 1-4, no stop words) using sklearn -->
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,4),
min_df = 0, stop_words = 'english', sublinear_tf=True)
response = vectorizer.fit_transform(abstract_list)
classes = score_list
feature_names = vectorizer.get_feature_names()
clf = neighbors.KNeighborsRegressor(n_neighbors=1)
clf.fit(response, classes)
predicted = clf.predict(response)
目前,如果我使用上述代码,则“predicted”会给出输出,例如 [3.2]。但是,我也希望输出为 [3.2, UoM, "Hello, this is abstract 2 and yet counted."]
我想显示k个最近的邻居(不仅是分数,还有对应的机构名称和摘要)。我怎样才能做到这一点?
【问题讨论】:
【参考方案1】:拟合模型后,您需要run the model against a point:
>>> samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
>>> from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(n_neighbors=1)
>>> neigh.fit(samples)
NearestNeighbors(n_neighbors=1)
>>> print(neigh.kneighbors([[1., 1., 1.]]))
(array([[0.5]]), array([[2]]))
这将返回两个数组,其中第一个是距离列表,第二个是最近邻居的索引列表。为了以您想要的格式打印,您需要根据第二个列表的索引来查找摘要。
【讨论】:
感谢您的回答。您能否告诉我如何将 .kneighbors() 用于我的用例?以上是关于显示 k 个最近邻用于文本分类的主要内容,如果未能解决你的问题,请参考以下文章