情感分析管道,使用特征选择时获取正确特征名称的问题

Posted

技术标签:

【中文标题】情感分析管道,使用特征选择时获取正确特征名称的问题【英文标题】:Sentiment analysis Pipeline, problem getting the correct feature names when feature selection is used 【发布时间】:2019-11-15 23:09:51 【问题描述】:

在以下示例中,我使用 twitter 数据集执行情绪分析。我使用 sklearn 管道执行一系列转换,添加特征并添加分类器。最后一步是可视化具有更高预测能力的单词。当我不使用特征选择时它工作正常。但是,当我使用它时,我得到的结果毫无意义。我怀疑当应用特征选择时,文本特征的顺序会发生变化。有没有办法解决这个问题?

以下代码已更新为包含正确答案

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion

features= [c for c in df.columns.values if c  not in ['target']]
target = 'target'

#train test split
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2,stratify = df5[target], random_state=0)

#Create classes which allow to select specific columns from the dataframe

class NumberSelector(BaseEstimator, TransformerMixin):

    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]

class TextSelector(BaseEstimator, TransformerMixin):

    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.key]

class ColumnExtractor(TransformerMixin):

    def __init__(self, cols):
        self.cols = cols

    def fit(self, X, y=None):
        # stateless transformer
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        Xcols = X[self.cols]

        return Xcols

class DummyTransformer(TransformerMixin):

    def __init__(self):
        self.dv = None

    def fit(self, X, y=None):
        # assumes all columns of X are strings
        Xdict = X.to_dict('records')
        self.dv = DictVectorizer(sparse=False)
        self.dv.fit(Xdict)
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        Xdict = X.to_dict('records')
        Xt = self.dv.transform(Xdict)
        cols = self.dv.get_feature_names()
        Xdum = pd.DataFrame(Xt, index=X.index, columns=cols)

        # drop column indicating NaNs

        nan_cols = [c for c in cols if '=' not in c]
        Xdum = Xdum.drop(nan_cols, axis=1)
        Xdum.drop(list(Xdum.filter(regex = 'unknown')), axis = 1, inplace = True)

        return Xdum

def pipelinize(function, active=True):
    def list_comprehend_a_function(list_or_series, active=True):
        if active:
            return [function(i) for i in list_or_series]
        else: # if it's not active, just pass it right back
            return list_or_series
    return FunctionTransformer(list_comprehend_a_function, validate=False, kw_args='active':active)

#function to plot the coeficients of the words in the text with the highest predictive power
def plot_coefficients(classifier, feature_names, top_features=50):

    if classifier.__class__.__name__ == 'SVC':
        coef = classifier.coef_
        coef2 = coef.toarray().ravel()
        coef1 = coef2[:len(feature_names)]

    else:
        coef1 = classifier.coef_.ravel()

    top_positive_coefficients = np.argsort(coef1)[-top_features:]
    top_negative_coefficients = np.argsort(coef1)[:top_features]
    top_coefficients = np.hstack([top_negative_coefficients, top_positive_coefficients])
     # create plot
    plt.figure(figsize=(15, 5))
    colors = ['red' if c < 0 else 'blue' for c in coef1[top_coefficients]]
    plt.bar(np.arange(2 * top_features), coef1[top_coefficients], color=colors)
    feature_names = np.array(feature_names)
    plt.xticks(np.arange(1, 1 + 2 * top_features), feature_names[top_coefficients], rotation=90, ha='right')
    plt.show()

#create a custome stopwords list
stop_list = stopwords(remove_stop_word ,add_stop_word )

#vectorizer
tfidf=TfidfVectorizer(sublinear_tf=True, stop_words = set(stop_list),ngram_range = (1,2))

#categorical features
CAT_FEATS = ['location','account']

#dimensionality reduction
pca = TruncatedSVD(n_components=200)

#scaler for numerical features
scaler = StandardScaler()

#classifier
model = SVC(kernel = 'linear', probability=True, C=1, class_weight = 'balanced')

text = Pipeline([('selector', TextSelector(key='content')),('text_preprocess', pipelinize(text_preprocessing)),('vectorizer',tfidf),('important_features',select)])
followers =  Pipeline([('selector', NumberSelector(key='followers')),('scaler', scaler)])
location = Pipeline([('selector',ColumnExtractor(CAT_FEATS)),('scaler',DummyTransformer())])
feats = FeatureUnion([('text', text), ('length', followers), ('location',location)])
pipeline = Pipeline([('features',feats),('classifier', model)])
pipeline.fit(X_train, y_train)

preds = pipeline.predict(X_test)

feature_names = text.named_steps['vectorizer'].get_feature_names()
feature_names = np.array(feature_names)[text.named_steps['important_features'].get_support(True)]

classifier = pipe.named_steps['classifier']

plot_coefficients(classifier, feature_names)

特征选择之前

特征选择后

要使用功能选择,我将以下代码行更改为

text = Pipeline([('selector', TextSelector(key='content')),
                 ('text_preprocess', pipelinize(text_preprocessing)),
                 ('vectorizer',tfidf)])

select = SelectKBest(f_classif, k=8000)
text = Pipeline([('selector', TextSelector(key='content')),
                 ('text_preprocess', pipelinize(text_preprocessing)), 
                 ('vectorizer',tfidf), 
                 ('important_features',select)])

【问题讨论】:

我想指出您对“降维”一词的使用可以说是不正确的。通常降维指的是诸如 PCA 和 SVD 之类的东西,它们通过变换来减少空间。您所做的通常被称为“特征选择”。功能选择器确实是您的问题的根源,因为它将索引重新分配给新功能。这是一种查找映射的方法:***.com/questions/39839112/… @amdex 我已经更新了我的问题。不幸的是,您建议的方法在我的情况下不起作用,可能是因为矢量化器返回了一个稀疏矩阵。 【参考方案1】:

为什么会这样

发生这种情况是因为特征选择会选择最重要的特征并丢弃其他特征,因此索引不再有意义。

假设您有以下示例:

X = np.array(["This is the first document","This is the second document",
"This is the first again"])
y = np.array([0,1,0])

显然,驱动分类的两个主要词是“第一”和“第二”。使用与您类似的管道,您可以:

tfidf = TfidfVectorizer()
sel = SelectKBest(k = 2)
pipe = Pipeline([('vectorizer',tfidf), ('select',sel)])
pipe.fit(X,y)

feature_names = np.array(pipe['vectorizer'].get_feature_names())
feature_names[pipe['select'].get_support(True)]

>>> array(['first', 'second'], dtype='<U8')

因此,你需要做的不仅仅是从tfidf向量化中获取特征,还要通过pipe['select'].get_support(True)选择特征选择保留的索引。

您的代码中要更改的内容

因此,您应该在代码中更改的只是添加这行代码:

feature_names = text.named_steps['vectorizer'].get_feature_names()
## Add this line
feature_names = feature_names[text['important_features'].get_support(True)]
##
classifier = pipe.named_steps['classifier']
plot_coefficients(classifier, feature_names)

【讨论】:

您的原始答案产生了错误,但同时它为我提供了如何解决问题的良好方向。我已经更新了你的答案。如果您对编辑感到满意,很高兴接受您的回答。 如果你这样做,你能告诉我你是否得到错误:feature_names = np.array(feature_names)[text['important_features'].get_support(True)] ? 当我使用它时,我得到一个 'TypeError: 'Pipeline' object is not subscriptable' 错误。我认为问题出在 text['important_features'] 上,应该是 text.named_steps['important_features']。 它在我使用时有效:feature_names = np.array(feature_names)[text.named_steps['select_best_features'].get_support(True)]。我已经更新了你的答案。 我认为这可能是因为以前的 scikit learn 版本,两者应该确实是等价的

以上是关于情感分析管道,使用特征选择时获取正确特征名称的问题的主要内容,如果未能解决你的问题,请参考以下文章

管道中的python特征选择:如何确定特征名称?

如何在 python 中的 sklearn 中的不同管道中获取特征名称

文本情感分析:基于词袋模型(VSMLSAn-gram)的文本表示

scikit-learn:在管道中使用 SelectKBest 时获取选定的功能

基于机器学习的情感分析是啥意思

从 Sklearn 管道中提取具有特征名称的特征重要性