如何使用线性支持向量机 (SVM) 分类器确定最重要/信息量最大的特征
Posted
技术标签:
【中文标题】如何使用线性支持向量机 (SVM) 分类器确定最重要/信息量最大的特征【英文标题】:How to determine most Important/Informative features using Linear Support Vector Machines (SVM) classifier 【发布时间】:2019-09-11 23:24:10 【问题描述】:我是 python 新手,正在研究文本分类问题。我对通过线性 SVM 分类器模型对每个类的最重要特征进行可视化感兴趣。我想通过分类模型确定哪些特征有助于分类决策为 Class-1 或 Class-2。这是我的代码。
df = pd.read_csv('projectdatacor.csv')
df = df[pd.notnull(df['types'])]
my_types = ['Requirement','Non-Requirement']
#converting to lower case
df['description'] = df.description.map(lambda x: x.lower())
#Removing the punctuation
df['description'] = df.description.str.replace('[^\w\s]', '')
#splitting the word into tokens
df['description'] = df['description'].apply(nltk.tokenize.word_tokenize)
## This converts the list of words into space-separated strings
df['description'] = df['description'].apply(lambda x: ' '.join(x))
count_vect = CountVectorizer()
counts = count_vect.fit_transform(df['description'])
#tf-idf
transformer = TfidfTransformer().fit(counts)
counts = transformer.transform(counts)
#splitting the data and training the model
#naives-bayes
X_train, X_test, y_train, y_test = train_test_split(counts, df['types'], test_size=0.3, random_state=39)
#svc classification
from sklearn import svm
svclassifier = svm.SVC(gamma=0.001, C=100., kernel = 'linear')
svclassifier.fit(X_train, y_train)
y_pred = svclassifier.predict(X_test)
#evalutaing the model
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_types))
我已阅读此平台上所有可用的相关问题,但我发现以下有用的代码已添加到我的代码中。
import numpy as np
def show_most_informative_features(vectorizer, clf, n=20):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
print ("\t%.4f\t%-15s\t\t%.4f\t%-15s") % (coef_1, fn_1, coef_2, fn_2)
show_most_informative_features(count_vect, svclassifier, 20)
此代码适用于朴素贝叶斯和逻辑回归,它提供了最重要的功能,但对于 SVM,它给了我错误。
我收到此错误。
File "C:\Users\fhassan\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 704, in runfile
execfile(filename, namespace)
File "C:\Users\fhassan\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "U:/FAHAD UL HASSAN/Python Code/happycsv.py", line 209, in <module>
show_most_informative_features(count_vect, svclassifier, 20)
File "U:/FAHAD UL HASSAN/Python Code/happycsv.py", line 208, in show_most_informative_features
print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))
TypeError: must be real number, not csr_matrix
任何帮助将不胜感激。
【问题讨论】:
【参考方案1】:也许这会对你有所帮助:
from sklearn import svm
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
x=X.toarray()
y=[0,0,0,1]
model=svm.SVC(kernel='linear')
a=model.fit(x,y)
model.score(x,y)
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(model.coef_[0], feature_names))
df=pd.DataFrame(coefs_with_fns)
df.columns='coefficient','word'
df.sort_values(by='coefficient')
你会得到:
【讨论】:
感谢@Rubens_Zimbres 的评论。我正在我的代码中尝试这个,它没有给我所需的结果。请您帮助我应该更改我的代码的哪一部分以获得所需的结果。我是python新手,所以如果我犯了任何大错误,请忽略。我按如下方式修改了您的代码以获得所需的结果,但它对我没有帮助。请问我在哪里做错了。 我已将代码修改如下。 from sklearn.feature_extraction.text import CountVectorizer X_train=counts.toarray() y_train=[0,1] model=svm.SVC(kernel='linear') a=model.fit(X_train, y_train) model.score(X_train, y_train) feature_names = count_vect.get_feature_names() coefs_with_fns = sorted(zip(model.coef_[0], feature_names)) coefs_with_fns = sorted(zip(model.coef_[1], feature_names)) df=pd.DataFrame(coefs_with_fns) df .columns='coefficient','word' df.sort_values(by='coefficient') 这是我得到的错误。文件 "C:\Users\fhassan\anaconda3\lib\site-packages\sklearn\utils\validation.py",第 235 行,在 check_consistent_length " 样本中:%r" % [int(l) for l in lengths]) ValueError : 发现样本数量不一致的输入变量:[1720, 2]以上是关于如何使用线性支持向量机 (SVM) 分类器确定最重要/信息量最大的特征的主要内容,如果未能解决你的问题,请参考以下文章