在 SVC(kernel='linear') 模型的 tfidfvectorizer 中提取 ngram 的特征重要性

Posted

技术标签:

【中文标题】在 SVC(kernel=\'linear\') 模型的 tfidfvectorizer 中提取 ngram 的特征重要性【英文标题】:Extract feature importance of ngrams in tfidfvectorizer in SVC(kernel='linear') model在 SVC(kernel='linear') 模型的 tfidfvectorizer 中提取 ngram 的特征重要性 【发布时间】:2020-11-19 07:27:56 【问题描述】:

我想知道是什么原因导致输出应该相同时出现差异。就像我的程序忽略了排序函数和特征名称一样。 coef_ 的排序对于我找出哪些特征实际上对预测最有帮助非常重要。我从 vectorizer.get_feature_names 中获取单个单词,但在循环或函数定义中没有。有谁知道会发生什么,或者是否有人有其他方法可以为 kernel='linear' 的 SVC 提取 ngram 特征权重及其名称。

我的代码:

## load data features with removed columns based on numeric feature selection
df = pd.read_csv('preprocessed_data_all.csv', usecols=['normalized_fixed', 'TAG', 'DEP', 'level1', 'avg_wordlength',
 'lexical_variety',
 'avg_sentlength',
 'VBD_rel_cnt',
 'VBN_rel_cnt',
 'VBG_rel_cnt',
 'MD_rel_cnt',
 'np_rel_cnt',
 'clause_rel_cnt',
 'clause_rel_word_cnt'])

df = df.sample(n=2000)

# define X and y 
X = df.drop('level1', axis=1)
y = df.level1.values

## create pipeline for word unigrams
bow_pipe = Pipeline([
    ("text", ItemSelector(key="normalized_fixed")),
    ("bow_vec", TfidfVectorizer(analyzer='word', tokenizer=word_tokenize, binary=False, lowercase=True))
])

## create pipeline for pos tags
pos_pipe = Pipeline([
    ("pos", ItemSelector(key="TAG")),
    ("pos_vec", TfidfVectorizer(analyzer='word', tokenizer=word_tokenize, binary=False, lowercase=True))
])

## create pipeline for dependency tags
dep_pipe = Pipeline([
    ("dep", ItemSelector(key="DEP")),
    ("dep_vec", TfidfVectorizer(analyzer='word', tokenizer=word_tokenize, binary=False, lowercase=True))
])

# define classifier
svm = SVC(kernel='linear', class_weight='balanced')

# define pipeline for most important unigram extraction
pipe = Pipeline([
    ("feats", FeatureUnion([
        ("bow", bow_pipe),
        ("tag", pos_pipe),
        ("dep", dep_pipe)
    ])),
    ("clf", svm)
])

pipe.fit(X, y)

# display feature importance of BOW model
levels = df['level1'].unique()

for level in levels: 
    
    featuredf = pd.DataFrame()

    labelid = list(pipe.named_steps['clf'].classes_).index(level)
    feature_names = pipe.named_steps['feats'].transformer_list[0][1].named_steps['bow_vec'].get_feature_names()
    topn = sorted(zip(pipe.named_steps['clf'].coef_[labelid], feature_names))[-10:]

    for coef, feat in topn:
        featuredf = featuredf.append(pd.Series([level, feat, coef]), ignore_index = True)

    display(featuredf)

我的输出:

    0   1   2
0   A1  !   (0, 1834)\t-0.07826243560812945\n (0, 4347)\t-0.07826243560812945\n (0, 4760)\t-0.223132736239871\n (0, 5498)\t-0.07140284578344763\n (0, 6756)\t-0.16195282546411804\n (0, 8637)\t-0.06337764014791308\n (0, 8763)\t-0.07826243560812945\n (0, 9044)\t-0.08060172162144445\n (0, 901)\t-0.0026223432774063423\n (0, 5906)\t-0.16675468967573015\n (0, 6796)\t-0.04403627031278603\n (0, 8495)\t-0.2603055807883978\n (0, 8498)\t-0.17305812627971506\n (0, 8735)\t-0.34489400420874144\n (0, 9484)\t-0.11083343873432677\n (0, 2637)\t-0.18040783909656172\n (0, 2737)\t-0.5380874813828527\n (0, 3129)\t-0.013035612996414479\n (0, 3773)\t-0.08449907288128825\n (0, 4437)\t-0.013035612996414479\n (0, 4438)\t-0.026071225992828958\n (0, 5924)\t-0.013035612996414479\n (0, 7269)\t-0.3730438143689519\n (0, 7737)\t-0.705047869548585\n (0, 8722)\t-0.024098248030544236\n :\t:\n (0, 2842)\t0.026095216126881034\n (0, 2945)\t-0.08649110380251428\n (0, 3325)\t0.02860215372933612\n (0, 3495)\t0.02860215372933612\n (0, 3539)\t0.027135689240252094\n (0, 4142)\t0.02860215372933612\n (0, 4305)\t0.026095216126881034\n (0, 4711)\t0.025288162480521178\n (0, 5173)\t0.05720430745867224\n (0, 5745)\t0.08580646118800836\n (0, 6561)\t0.025288162480521178\n (0, 6865)\t0.023162287148712983\n (0, 6980)\t0.022121814035341927\n (0, 7349)\t0.02860215372933612\n (0, 7498)\t0.02860215372933612\n (0, 7573)\t0.024071227714247634\n (0, 7606)\t-0.3512603080433229\n (0, 8034)\t0.02860215372933612\n (0, 8304)\t-0.005938550730073638\n (0, 8445)\t-0.06546829964399035\n (0, 8634)\t0.027135689240252094\n (0, 9268)\t0.02860215372933612\n (0, 9471)\t0.026095216126881034\n (0, 9630)\t0.022781224878066095\n (0, 3739)\t0.03267210725715032
0   1   2
0   B1  !   (0, 353)\t-0.00449726057217602\n (0, 802)\t-0.05617787611978642\n (0, 973)\t-0.10543173735834135\n (0, 1847)\t-0.007780155148241354\n (0, 1989)\t-0.003934155442206846\n (0, 2017)\t-0.005086622578660749\n (0, 2204)\t-0.031113872051853505\n (0, 2405)\t-0.09318613349857544\n (0, 3024)\t-0.005086622578660749\n (0, 3283)\t-0.10089509076272042\n (0, 4556)\t-0.00449726057217602\n (0, 5175)\t-0.005086622578660749\n (0, 5454)\t-0.32011264216698354\n (0, 5724)\t-0.003934155442206846\n (0, 6015)\t-0.005086622578660749\n (0, 6330)\t-0.005086622578660749\n (0, 6473)\t-0.004194952284695256\n (0, 6534)\t-0.19221655261459114\n (0, 6582)\t-0.031591903060786936\n (0, 7980)\t-0.32174386411546047\n (0, 7992)\t-0.004825825736172337\n (0, 9514)\t-0.17326784128005032\n (0, 9556)\t-0.08135115057424913\n (0, 9654)\t-0.004194952284695256\n (0, 9746)\t-0.24722791235969363\n :\t:\n (0, 2550)\t0.02860215372933612\n (0, 2842)\t0.018006244391458114\n (0, 2945)\t-0.025529859622348973\n (0, 3325)\t0.02860215372933612\n (0, 3495)\t0.02860215372933612\n (0, 3539)\t0.027135689240252094\n (0, 4142)\t0.02860215372933612\n (0, 4305)\t0.018006244391458114\n (0, 4711)\t-0.012262917681600417\n (0, 5173)\t0.05720430745867224\n (0, 5745)\t0.08580646118800836\n (0, 6561)\t-0.05074854733491452\n (0, 6865)\t0.023162287148712983\n (0, 6980)\t-0.003648836702785919\n (0, 7349)\t0.02860215372933612\n (0, 7498)\t0.02860215372933612\n (0, 7573)\t0.024071227714247634\n (0, 7606)\t-0.20173441895833755\n (0, 8034)\t0.02860215372933612\n (0, 8304)\t0.026095216126881034\n (0, 8445)\t-0.049447249347229494\n (0, 8634)\t0.027135689240252094\n (0, 9268)\t0.02860215372933612\n (0, 9471)\t-0.032859710207201465\n (0, 9630)\t0.022781224878066095
0   1   2
0   A2  !   (0, 1510)\t-0.047241319436499236\n (0, 4554)\t-0.09138323899895806\n (0, 5454)\t-0.0062230357565567634\n (0, 7756)\t-0.061785302573242856\n (0, 281)\t-0.01653184338009155\n (0, 351)\t-0.01653184338009155\n (0, 450)\t-0.3274464832370879\n (0, 809)\t-0.013387638815271769\n (0, 2051)\t-0.014616379782250303\n (0, 2586)\t-0.01653184338009155\n (0, 2741)\t-0.15810190062867993\n (0, 3224)\t-0.06225932224260644\n (0, 3373)\t-0.12280247879038902\n (0, 3421)\t-0.015684237235273946\n (0, 3819)\t-0.01653184338009155\n (0, 3833)\t-0.3359646619748352\n (0, 4068)\t-0.015684237235273946\n (0, 4402)\t-0.07152757844346042\n (0, 4649)\t-0.3279430542171356\n (0, 5524)\t-0.0899771265578215\n (0, 5790)\t-0.3885263430136202\n (0, 7822)\t-0.059872091526754725\n (0, 505)\t-0.0711692477199759\n (0, 5724)\t-0.16023961429736408\n (0, 6286)\t-0.049366239531379814\n :\t:\n (0, 2550)\t0.02860215372933612\n (0, 2842)\t0.026095216126881034\n (0, 2945)\t-0.08531697506522545\n (0, 3325)\t0.02860215372933612\n (0, 3495)\t0.02860215372933612\n (0, 3539)\t0.027135689240252094\n (0, 4142)\t0.02860215372933612\n (0, 4305)\t0.026095216126881034\n (0, 4711)\t0.007931536811222977\n (0, 5173)\t0.05720430745867224\n (0, 5745)\t0.08580646118800836\n (0, 6561)\t-0.10456248357681736\n (0, 6865)\t-0.03968381644376268\n (0, 6980)\t-0.11581955114710678\n (0, 7349)\t0.02860215372933612\n (0, 7498)\t0.02860215372933612\n (0, 7573)\t-0.28868175238220484\n (0, 7606)\t-0.03988088938576907\n (0, 8034)\t0.02860215372933612\n (0, 8304)\t0.026095216126881034\n (0, 8445)\t0.008341811597571965\n (0, 8634)\t0.027135689240252094\n (0, 9268)\t0.02860215372933612\n (0, 9471)\t-0.0659929953660953\n (0, 9630)\t-0.05573508827608763
0   1   2
0   B2  !   (0, 1604)\t-0.5452053299558446\n (0, 1611)\t-0.14349203210584277\n (0, 1786)\t-0.07926751381540288\n (0, 4402)\t-0.061638227000430465\n (0, 4469)\t-0.18047516283733558\n (0, 4483)\t-0.12632546444958545\n (0, 7467)\t-0.1657501793150448\n (0, 7953)\t-0.2027592110690899\n (0, 7991)\t-0.0705445978132748\n (0, 9157)\t-0.1576966747397613\n (0, 9746)\t-0.13158162095004766\n (0, 776)\t-0.07804759361864515\n (0, 1432)\t-0.04319046246215665\n (0, 1630)\t-0.06742619934269474\n (0, 1903)\t-0.03634857244837165\n (0, 2742)\t-0.04319046246215665\n (0, 2816)\t-0.15050859335152222\n (0, 3562)\t-0.03940488059869191\n (0, 4318)\t-0.04097603902861966\n (0, 4490)\t-0.04319046246215665\n (0, 5187)\t-0.27333877764907855\n (0, 5252)\t-0.04319046246215665\n (0, 5551)\t-0.22302927657831634\n (0, 5790)\t-0.18300512305356684\n (0, 5852)\t-0.029396346557071712\n :\t:\n (0, 2550)\t0.02860215372933612\n (0, 2842)\t0.026095216126881034\n (0, 2945)\t-0.05047091757718261\n (0, 3325)\t0.02860215372933612\n (0, 3495)\t0.02860215372933612\n (0, 3539)\t0.027135689240252094\n (0, 4142)\t0.02860215372933612\n (0, 4305)\t0.026095216126881034\n (0, 4711)\t0.025288162480521178\n (0, 5173)\t0.05720430745867224\n (0, 5745)\t0.08580646118800836\n (0, 6561)\t0.025288162480521178\n (0, 6865)\t0.023162287148712983\n (0, 6980)\t-0.09066783596741043\n (0, 7349)\t0.02860215372933612\n (0, 7498)\t0.02860215372933612\n (0, 7573)\t0.024071227714247634\n (0, 7606)\t-0.08208799126770042\n (0, 8034)\t0.02860215372933612\n (0, 8304)\t0.026095216126881034\n (0, 8445)\t-0.011737613363530505\n (0, 8634)\t-0.014588840340588008\n (0, 9268)\t0.02860215372933612\n (0, 9471)\t0.026095216126881034\n (0, 9630)\t-0.05074822281767788
0   1   2
0   C1  !   (0, 244)\t-0.09884319674162795\n (0, 690)\t-0.1388650034822139\n (0, 960)\t-0.10605470450461775\n (0, 1373)\t-0.29793485494660743\n (0, 1584)\t-0.15220560572907585\n (0, 1603)\t-0.15220560572907585\n (0, 1604)\t-0.2943386139167361\n (0, 1638)\t-0.15220560572907585\n (0, 2080)\t-0.1444018536776252\n (0, 2680)\t-0.22203398402397742\n (0, 2722)\t-0.1388650034822139\n (0, 2774)\t-0.15220560572907585\n (0, 2822)\t-0.13106125143076322\n (0, 3071)\t-0.1444018536776252\n (0, 3265)\t-0.2691405776324631\n (0, 3627)\t-0.1444018536776252\n (0, 4014)\t-0.15220560572907585\n (0, 4073)\t-0.15220560572907585\n (0, 4247)\t-0.15220560572907585\n (0, 4659)\t-0.3056381346962476\n (0, 4726)\t-0.3044112114581517\n (0, 4868)\t-0.15220560572907585\n (0, 5014)\t-0.1388650034822139\n (0, 5074)\t-0.1444018536776252\n (0, 5450)\t-0.1865505674300888\n :\t:\n (0, 2550)\t0.02860215372933612\n (0, 2842)\t0.026095216126881034\n (0, 2945)\t0.020274287275611008\n (0, 3325)\t0.02860215372933612\n (0, 3495)\t0.02860215372933612\n (0, 3539)\t0.027135689240252094\n (0, 4142)\t0.02860215372933612\n (0, 4305)\t0.026095216126881034\n (0, 4711)\t0.025288162480521178\n (0, 5173)\t0.05720430745867224\n (0, 5745)\t0.08580646118800836\n (0, 6561)\t0.025288162480521178\n (0, 6865)\t0.023162287148712983\n (0, 6980)\t-0.09559883514855932\n (0, 7349)\t0.02860215372933612\n (0, 7498)\t0.02860215372933612\n (0, 7573)\t0.024071227714247634\n (0, 7606)\t-0.13840215323517818\n (0, 8034)\t0.02860215372933612\n (0, 8304)\t0.026095216126881034\n (0, 8445)\t0.02183231985676501\n (0, 8634)\t0.027135689240252094\n (0, 9268)\t0.02860215372933612\n (0, 9471)\t0.026095216126881034\n (0, 9630)\t-0.09844846169130346
0   1   2
0   C2  !   (0, 1510)\t-0.05959414411701482\n (0, 1925)\t-0.07930619936349027\n (0, 4554)\t-0.12239420823695288\n (0, 6337)\t-0.10751794751817559\n (0, 6919)\t-0.11905736382524099\n (0, 7940)\t-0.4509514511406135\n (0, 8674)\t-0.06634760477509609\n (0, 8876)\t-0.09955165597499022\n (0, 281)\t-0.11414308129642091\n (0, 351)\t-0.11414308129642091\n (0, 450)\t-0.6510470682457199\n (0, 2051)\t-0.2941575097638065\n (0, 2586)\t-0.11414308129642091\n (0, 2741)\t-0.2826990863413035\n (0, 3224)\t-0.19484491340690077\n (0, 3421)\t-0.15593785912375202\n (0, 3819)\t-0.11414308129642091\n (0, 3833)\t-0.5959237864651308\n (0, 4068)\t-0.17821843998174858\n (0, 7822)\t-0.18737390753407893\n (0, 505)\t-0.17144684026449575\n (0, 1605)\t-0.1761432003933918\n (0, 2087)\t-0.30641937334729397\n (0, 2435)\t-0.01959392434397809\n (0, 2544)\t-0.04205087201608662\n :\t:\n (0, 1543)\t-0.10270397939145476\n (0, 4091)\t0.03510805887202453\n (0, 4621)\t0.03774871894954156\n (0, 5548)\t0.11591078340050759\n (0, 7216)\t0.10996790758187949\n (0, 8462)\t0.11591078340050759\n (0, 6902)\t0.0953038712770817\n (0, 275)\t0.05201608398317632\n (0, 2309)\t-0.11518506286788033\n (0, 5602)\t0.130229016246211\n (0, 6856)\t0.028341275217697887\n (0, 9697)\t0.130229016246211\n (0, 5898)\t0.11137290054484401\n (0, 5921)\t0.09019079917603834\n (0, 6930)\t-0.41244426698962444\n (0, 7468)\t0.11137290054484401\n (0, 8776)\t0.08870699446777193\n (0, 981)\t0.1483908905002444\n (0, 2084)\t0.1483908905002444\n (0, 3159)\t0.02642985841317767\n (0, 3306)\t-0.02982547695328551\n (0, 3508)\t0.055716356911763666\n (0, 8305)\t0.025416449215215475\n (0, 8431)\t0.02642985841317767\n (0, 8682)\t0.027858178455881833

输出格式应该是什么:

bs obećao -4.50534985071
bs pošto -4.50534985071
bs prava -4.50534985071
bs predstavlja -4.50534985071
bs prošlosedmičnom -4.50534985071
bs sjeveru -4.50534985071
bs taj -4.50534985071
bs vladavine -4.50534985071
bs će -4.50534985071
bs da -4.0998847426

pt teve -4.63472898823
pt tive -4.63472898823
pt todas -4.63472898823
pt vida -4.63472898823
pt de -4.22926388012
pt foi -4.22926388012
pt mais -4.22926388012
pt me -4.22926388012
pt as -3.94158180767
pt que -3.94158180767

链接到How to get most informative features for scikit-learn classifier for different class? 上的第二个回复 与此相关的还有作为对该帖子问题的第二次回复的最后回复发布的确切问题:

Amazing @alvas 我尝试了上述函数,但输出如下所示:POS aaeguno móvil (0, 60) -0.0375375709849 (0, 300) -0.0375375709849 (0, 3279) -0.0375375709849 而不是返回类,然后是这个词和浮动。知道为什么会这样吗?谢谢! – newWithPython 2015 年 3 月 15 日 0:45

但是没有人回复这个问题,而且由于我的声誉很低,我也无法在那里请求更多信息。

这已经超出我的日程安排了一个星期,我真的不能再花太多时间在这上面了。这是我的论文的最后一块拼图,它不会是完美的,但我只需要完成它并毕业。所以任何帮助将不胜感激! 也让我知道我可以添加什么来使这个问题更清楚,这可能是我在这个平台上的第三个或第二个问题。

【问题讨论】:

【参考方案1】:

事实证明,使用 sklearn 的 LinearSVC() 会产生正确的输出,因此 SVC(kernel='linear') 需要其他方法来提取 ngram 重要性。我刚刚切换到 LinearSVC,因为它总体上改进了我的模型。

【讨论】:

以上是关于在 SVC(kernel='linear') 模型的 tfidfvectorizer 中提取 ngram 的特征重要性的主要内容,如果未能解决你的问题,请参考以下文章

LinearSVC 和 SVC(kernel="linear") 有啥区别?

LinearSVC() 与 SVC(kernel='linear')

linearRegression.score 和 svm.svc(kernel = linear) 产生的不同结果

sklearn.svm包中的SVC(kernel=”linear“)和LinearSVC的区别

调整参数 SVM

机器学习sklearn----支持向量机SVC重要参数核函数kernel如何选择