线性模型文本分类中的特征重要性,StandardScaler(with_mean=False) yes or no
Posted
技术标签:
【中文标题】线性模型文本分类中的特征重要性,StandardScaler(with_mean=False) yes or no【英文标题】:Feature importances in linear model text classification, StandardScaler(with_mean=False) yes or no 【发布时间】:2020-02-25 02:04:06 【问题描述】:在带有scikit-learn 和SGDClassifier 线性模型的二进制文本分类中,我想通过模型系数获得每个类别的特征重要性。对于这种情况,是否应该使用 StandardScaler(with_mean=False) 缩放列(特征),我听到了不同的意见。
对于稀疏数据,无论如何都无法在缩放之前对数据进行居中(with_mean=False 部分)。 TfidfVectorizer 默认情况下,L2 行也已经对每个实例进行了规范化。根据经验结果(例如下面的自包含示例),似乎每个类的***功能在不使用 StandardScaler 时更直观。例如,'nasa' 和 'space' 是 sci.space 的***标记,'god' 和 'christians' 是 talk.religion.misc 等的***标记。
我错过了什么吗?在这种 NLP 案例中,是否仍应使用 StandardScaler(with_mean=False) 从线性模型系数中获取特征重要性?
在没有 StandardScaler(with_mean=False) 的情况下,这些特征重要性在这种情况下从理论上来说仍然不可靠吗?
# load text from web
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'),
categories=['sci.space','talk.religion.misc'])
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),
categories=['sci.space','talk.religion.misc'])
# setup grid search, optionally use scaling
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
text_clf = Pipeline([
('vect', TfidfVectorizer(ngram_range=(1, 2), min_df=2, max_df=0.8)),
# remove comment below to use scaler
#('scaler', StandardScaler(with_mean=False)),
#
('clf', SGDClassifier(random_state=0, max_iter=1000))
])
from sklearn.model_selection import GridSearchCV
parameters =
'clf__alpha': (0.0001, 0.001, 0.01, 0.1, 1.0, 10.0)
# find best model
gs_clf = GridSearchCV(text_clf, parameters, cv=8, n_jobs=-1, verbose=-2)
gs_clf.fit(newsgroups_train.data, newsgroups_train.target)
# model performance, very similar with and without scaling
y_predicted = gs_clf.predict(newsgroups_test.data)
from sklearn import metrics
print(metrics.classification_report(newsgroups_test.target, y_predicted))
# use eli5 to get feature importances, corresponds to the coef_ of the model, only top 10 lowest and highest for brevity of this posting
from eli5 import show_weights
show_weights(gs_clf.best_estimator_.named_steps['clf'], vec=gs_clf.best_estimator_.named_steps['vect'], top=(10, 10))
# Outputs:
No scaling:
Weight? Feature
+1.872 god
+1.235 objective
+1.194 christians
+1.164 koresh
+1.149 such
+1.147 jesus
+1.131 christian
+1.111 that
+1.065 religion
+1.060 kent
… 10616 more positive …
… 12664 more negative …
-0.922 on
-0.939 it
-0.976 get
-0.977 launch
-0.994 edu
-1.071 at
-1.098 thanks
-1.117 orbit
-1.210 nasa
-2.627 space
StandardScaler:
Weight? Feature
+0.040 such
+0.023 compuserve
+0.021 cockroaches
+0.017 how about
+0.016 com
+0.014 figures
+0.014 inquisition
+0.013 time no
+0.012 long time
+0.010 fellowship
… 11244 more positive …
… 14299 more negative …
-0.011 sherzer
-0.011 sherzer methodology
-0.011 methodology
-0.012 update
-0.012 most of
-0.012 message
-0.013 thanks for
-0.013 thanks
-0.028 ironic
-0.032 <BIAS>
【问题讨论】:
【参考方案1】:我没有这方面的理论基础,但是TfidfVectorizer()
之后的缩放功能让我有点紧张,因为这似乎会损坏 idf 部分。我对TfidfVectorizer()
的理解是,从某种意义上说,它可以跨文档和功能扩展。如果您的带有惩罚的估计方法在没有缩放的情况下效果很好,我想不出任何缩放的理由。
【讨论】:
以上是关于线性模型文本分类中的特征重要性,StandardScaler(with_mean=False) yes or no的主要内容,如果未能解决你的问题,请参考以下文章