线性模型文本分类中的特征重要性,StandardScaler(with_mean=False) yes or no

Posted

技术标签:

【中文标题】线性模型文本分类中的特征重要性,StandardScaler(with_mean=False) yes or no【英文标题】:Feature importances in linear model text classification, StandardScaler(with_mean=False) yes or no 【发布时间】:2020-02-25 02:04:06 【问题描述】:

在带有scikit-learn 和SGDClassifier 线性模型的二进制文本分类中,我想通过模型系数获得每个类别的特征重要性。对于这种情况,是否应该使用 StandardScaler(with_mean=False) 缩放列(特征),我听到了不同的意见。

对于稀疏数据,无论如何都无法在缩放之前对数据进行居中(with_mean=False 部分)。 TfidfVectorizer 默认情况下,L2 行也已经对每个实例进行了规范化。根据经验结果(例如下面的自包含示例),似乎每个类的***功能在不使用 StandardScaler 时更直观。例如,'nasa' 和 'space' 是 sci.space 的***标记,'god' 和 'christians' 是 talk.religion.misc 等的***标记。

我错过了什么吗?在这种 NLP 案例中,是否仍应使用 StandardScaler(with_mean=False) 从线性模型系数中获取特征重要性?

在没有 StandardScaler(with_mean=False) 的情况下,这些特征重要性在这种情况下从理论上来说仍然不可靠吗?

# load text from web
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), 
                                    categories=['sci.space','talk.religion.misc'])
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), 
                                    categories=['sci.space','talk.religion.misc'])

# setup grid search, optionally use scaling
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler

text_clf = Pipeline([
    ('vect', TfidfVectorizer(ngram_range=(1, 2), min_df=2, max_df=0.8)),
    # remove comment below to use scaler
    #('scaler', StandardScaler(with_mean=False)),
    #
    ('clf', SGDClassifier(random_state=0, max_iter=1000))
])

from sklearn.model_selection import GridSearchCV
parameters = 
    'clf__alpha': (0.0001, 0.001, 0.01, 0.1, 1.0, 10.0)


# find best model
gs_clf = GridSearchCV(text_clf, parameters, cv=8, n_jobs=-1, verbose=-2)
gs_clf.fit(newsgroups_train.data, newsgroups_train.target)

# model performance, very similar with and without scaling
y_predicted = gs_clf.predict(newsgroups_test.data)
from sklearn import metrics
print(metrics.classification_report(newsgroups_test.target, y_predicted))

# use eli5 to get feature importances, corresponds to the coef_ of the model, only top 10 lowest and highest for brevity of this posting
from eli5 import show_weights
show_weights(gs_clf.best_estimator_.named_steps['clf'], vec=gs_clf.best_estimator_.named_steps['vect'], top=(10, 10))    


# Outputs:

No scaling:
Weight?     Feature
+1.872  god
+1.235  objective
+1.194  christians
+1.164  koresh
+1.149  such
+1.147  jesus
+1.131  christian
+1.111  that
+1.065  religion
+1.060  kent
… 10616 more positive …
… 12664 more negative …
-0.922  on
-0.939  it
-0.976  get
-0.977  launch
-0.994  edu
-1.071  at
-1.098  thanks
-1.117  orbit
-1.210  nasa
-2.627  space 

StandardScaler:
Weight?     Feature
+0.040  such
+0.023  compuserve
+0.021  cockroaches
+0.017  how about
+0.016  com
+0.014  figures
+0.014  inquisition
+0.013  time no
+0.012  long time
+0.010  fellowship
… 11244 more positive …
… 14299 more negative …
-0.011  sherzer
-0.011  sherzer methodology
-0.011  methodology
-0.012  update
-0.012  most of
-0.012  message
-0.013  thanks for
-0.013  thanks
-0.028  ironic
-0.032  <BIAS> 

【问题讨论】:

【参考方案1】:

我没有这方面的理论基础,但是TfidfVectorizer() 之后的缩放功能让我有点紧张,因为这似乎会损坏 idf 部分。我对TfidfVectorizer() 的理解是,从某种意义上说,它可以跨文档和功能扩展。如果您的带有惩罚的估计方法在没有缩放的情况下效果很好,我想不出任何缩放的理由。

【讨论】:

以上是关于线性模型文本分类中的特征重要性,StandardScaler(with_mean=False) yes or no的主要内容,如果未能解决你的问题,请参考以下文章

文本分类特征工程概述

Scikits Learn:线性核 SVM 中的特征权重

NLP——天池新闻文本分类 Task4:fasttext深度学习

文本分类-TextCNN

07 线性分类器(Linear Classifiers)

感知机