SVM的余弦相似性核
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了SVM的余弦相似性核相关的知识,希望对你有一定的参考价值。
所以我一直在研究这个聊天机器人项目,我正在使用SVM作为其ML,我真的想使用余弦相似度作为内核。我尝试过使用pykernel(as suggested from this post)或来自不同来源的其他代码,但它仍然无法正常工作,我不知道为什么......
说我有像这样的train.py
代码
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pickle, csv, json, timeit, random, os, nltk
from nltk.stem.lancaster import LancasterStemmer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split as tts
from sklearn.preprocessing import LabelEncoder as LE
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
import my_kernel
def preprocessing(text):
factory1 = StopWordRemoverFactory()
StopWord = factory1.create_stop_word_remover()
text = StopWord.remove(text)
factory2 = StemmerFactory()
stemmer = factory2.create_stemmer()
return (stemmer.stem(text))
le = LE()
tfv = TfidfVectorizer(min_df=1)
file = os.path.join(os.path.dirname(os.path.abspath(__file__)),"scraping","tes.json")
svm_pickle_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),"data","svm_model.pickle")
if os.path.exists(svm_pickle_path):
os.remove(svm_pickle_path)
tit = [] # Title
cat = [] # Category
post = [] # Post
with open(file, "r") as sentences_file:
reader = json.load(sentences_file)
for row in reader:
tit.append(preprocessing(row["Judul"]))
cat.append(preprocessing(row["Kategori"]))
post.append(preprocessing(row["Post"]))
tfv.fit(tit)
le.fit(cat)
features = tfv.transform(tit)
labels = le.transform(cat)
trainx, testx, trainy, testy = tts(features, labels, test_size=.30, random_state=42)
model = SVC(kernel=my_kernel, C=1.5)
f = open(svm_pickle_path, 'wb')
pickle.dump(model.fit(trainx, trainy), f)
f.close()
print("SVC training score:", model.score(testx, testy))
with open(svm_pickle_path, 'rb') as file:
pickle_model = pickle.load(file)
score = pickle_model.score(testx, testy)
print("Test score: {0:.2f} %".format(100 * score))
Ypredict = pickle_model.predict(testx)
print(Ypredict)
并为my_kernel.py
代码:
import numpy as np
import math
from numpy import linalg as LA
def my_kernel(X, Y):
norm = LA.norm(X) * LA.norm(Y)
return np.dot(X, Y.T)/norm
它每次运行程序时都会显示出来
Traceback (most recent call last):
File "F:envchatbotchatbotProjchatbotProj rain.py", line 84, in <module>
pickle.dump(model.fit(trainx, trainy), f)
File "F:envlibsite-packagessklearnsvmase.py", line 212, in fit
fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
File "F:envlibsite-packagessklearnsvmase.py", line 252, in _dense_fit
X = self._compute_kernel(X)
File "F:envlibsite-packagessklearnsvmase.py", line 380, in _compute_kernel
kernel = self.kernel(X, self.__Xfit)
File "F:envchatbotchatbotProjchatbotProjChatbotCodesvm.py", line 31, in my_kernel
norm = LA.norm(X) * LA.norm(Y)
File "F:envlibsite-packages
umpylinalglinalg.py", line 2359, in norm
sqnorm = dot(x, x)
File "F:envlibsite-packagesscipysparsease.py", line 478, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch
我是python和这个SVM领域的新手,是否有人知道什么是错的或者可以推荐我如何更好地和更清晰地编写余弦相似度内核?
哦,并且,列车X的尺寸是(193,634),列车Y是(193,),测试X是(83,634),测试Y是(83,)来自train_test_split
sklearn。
答案
更新:我的朋友告诉我它发生了,因为我有稀疏矩阵而不是一个简单的数组,所以我必须密集它并将my_kernel.py
代码替换为这样
def my_kernel(X, Y):
X=np.array(X.todense())
Y=np.array(Y.todense())
norm = LA.norm(X) * LA.norm(Y)
return np.dot(X, Y.T)/norm
以上是关于SVM的余弦相似性核的主要内容,如果未能解决你的问题,请参考以下文章