使用 scikit-learn 进行文档分类：获取对分类影响更大的单词（标记）的最有效方法

Posted 2023-03-12

技术标签:

【中文标题】使用 scikit-learn 进行文档分类：获取对分类影响更大的单词（标记）的最有效方法【英文标题】：Document Classification with scikit-learn: most efficient way to get the words (token) that impacted more on the classification 【发布时间】：2018-07-02 05:44:23 【问题描述】：

我使用文档训练集的 tf-idf 表示构建了一个文档二项式分类器，并对其应用了逻辑回归：

lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))])

lr_tfidf.fit(X_train, y_train)

我以pickle格式保存了模型，并用它来分类新文档，得到文档属于A类的概率和模型属于B类的概率。

text_model = pickle.load(open('text_model.pkl', 'rb'))
results = text_model.predict_proba(new_document)

同时获得对分类影响更大的词（或者，一般来说，令牌）的最佳方法是什么？我希望得到：

文档中包含的 N 个标记，其在 Logistic 回归模型中具有更高系数作为特征文档中包含的 N 个标记，其在 Logistic 回归模型中具有较低系数作为特征

我正在使用 sklearn v 0.19

【问题讨论】：

您可以访问管道内的tfidf 并获取用作特征的单词。然后从clf 中获取coef_ 并将单词 from 映射到这些系数。感谢@VivekKumar。如果我没记错的话，这样我就得到了模型的特征/系数，相反，我想要得到的是，对于一个新文档，对选择影响更大的特征（单词）（当然是取决于模型的特征/系数）。你能提供有效的代码来得到这个结果吗？重要与否的特征是在训练期间决定的。在测试或预测时只使用学到的知识。当然，但是要分类的文档将仅包含用作特征的单词的子集。我的问题很简单：获取对给定（不属于训练集）文档的分类影响更大的单词的最有效方法是什么？我可以考虑特征集和描述文档的标记集的交集，只考虑顶部和底部特征，但也许有更好的方法。可以提供代码吗？ 【参考方案1】：

据我了解，您只是想查看参数并根据系数值进行排序。使用 .get_params() 函数，您可以获得系数。您可以对其进行argsort并选择top N，bot N。

【讨论】：

你能提供代码吗？据我所知，get_params提供了模型的参数，这不是我要找的。span> 【参考方案2】：

GitHub 上有一个解决方案，可以打印从管道中的分类器获得的最重要的特征：

https://gist.github.com/bbengfort/044682e76def583a12e6c09209c664a1

您想在他们的脚本中使用show_most_informative_features 函数。我用过，效果很好。

这是 Github 海报代码的复制粘贴：

def show_most_informative_features(model, text=None, n=20):

"""

Accepts a Pipeline with a classifer and a TfidfVectorizer and computes

the n most informative features of the model. If text is given, then will

compute the most informative features for classifying that text.



Note that this function will only work on linear models with coefs_

"""

# Extract the vectorizer and the classifier from the pipeline

vectorizer = model.named_steps['vectorizer']

classifier = model.named_steps['classifier']



# Check to make sure that we can perform this computation

if not hasattr(classifier, 'coef_'):

    raise TypeError(

        "Cannot compute most informative features on  model.".format(

            classifier.__class__.__name__

        )

    )



if text is not None:

    # Compute the coefficients for the text

    tvec = model.transform([text]).toarray()

else:

    # Otherwise simply use the coefficients

    tvec = classifier.coef_



# Zip the feature names with the coefs and sort

coefs = sorted(

    zip(tvec[0], vectorizer.get_feature_names()),

    key=itemgetter(0), reverse=True

)



topn  = zip(coefs[:n], coefs[:-(n+1):-1])



# Create the output string to return

output = []



# If text, add the predicted value to the output.

if text is not None:

    output.append("\"\"".format(text))

    output.append("Classified as: ".format(model.predict([text])))

    output.append("")



# Create two columns with most negative and most positive features.

for (cp, fnp), (cn, fnn) in topn:

    output.append(

        ":0.4f: >15    :0.4f: >15".format(cp, fnp, cn, fnn)

    )



return "\n".join(output)

【讨论】：

谢谢，我尝试了代码： print (show_most_informative_features(text_model, 'test content test content ....')) 但我得到 AttributeError: 'LogisticRegression' object has no attribute 'transform'跨度> 出现该错误是因为变换函数不能用于逻辑回归。我制作了适用于任何分类器模型（包括 logreg）的函数的修改版本。请参阅我的下一篇文章... 谢谢，我尝试了你的功能，但我认为它没有按预期工作，请参阅我的评论。【参考方案3】：

下面是 show_most_informative_features 函数的修改版本，适用于任何分类器：

def show_most_informative_features(model, vectorizer=None, text=None, n=20):
# Extract the vectorizer and the classifier from the pipeline
if vectorizer is None:
    vectorizer = model.named_steps['vectorizer']
else:
    vectorizer.fit_transform([text])

classifier = model.named_steps['classifier']
feat_names = vectorizer.get_feature_names()

# Check to make sure that we can perform this computation
if not hasattr(classifier, 'coef_'):
    raise TypeError(
        "Cannot compute most informative features on .".format(
            classifier.__class__.__name__
        )
    )    

# Otherwise simply use the coefficients
tvec = classifier.coef_

# Zip the feature names with the coefs and sort   
coefs = sorted(
    zip(tvec[0], feat_names),
    key=operator.itemgetter(0), reverse=True
)

# Get the top n and bottom n coef, name pairs
topn  = zip(coefs[:n], coefs[:-(n+1):-1])

# Create the output string to return
output = []

# If text, add the predicted value to the output.
if text is not None:
    output.append("\"\"".format(text))
    output.append(
        "Classified as: ".format(model.predict([text]))
    )
    output.append("")

# Create two columns with most negative and most positive features.
for (cp, fnp), (cn, fnn) in topn:
    output.append(
        ":0.4f: >15    :0.4f: >15".format(
            cp, fnp, cn, fnn
        )
    )

return "\n".join(output)

然后你可以这样调用函数：

vectorizer = TfidfVectorizer()
show_most_informative_features(model,vectorizer, "your text")

【讨论】：

谢谢，好像没有按预期工作；如果我想在模型中使用矢量化器，我想我必须将 None 传递给函数，而不是矢量化，对吗？但后来我认为在函数的开头缺少一个 vectorizer.fit_transform([text]) ......如果我添加它，我会得到 ValueError: X has 474 features per sample;预计 2795333。我认为您的代码包含错误：如果我通过矢量化器，feature_names 仅包含文本的特征，而 coefs 包含模型中的所有系数，您可以检查一下吗？我想您首先必须从 coefs 中获取与传递的文本特征相对应的系数。

以上是关于使用 scikit-learn 进行文档分类：获取对分类影响更大的单词（标记）的最有效方法的主要内容，如果未能解决你的问题，请参考以下文章