sklearn GridSearchCV：如何获得分类报告？

Posted 2023-03-12

技术标签:

【中文标题】sklearn GridSearchCV：如何获得分类报告？【英文标题】：sklearn GridSearchCV: how to get classification report? 【发布时间】：2017-03-29 18:33:39 【问题描述】：

我正在像这样使用 GridSearchCV：

corpus = load_files('corpus')

with open('stopwords.txt', 'r') as f:
    stop_words = [y for x in f.read().split('\n') for y in (x, x.title())]

x = corpus.data

y = corpus.target

pipeline = Pipeline([
    ('vec', CountVectorizer(stop_words=stop_words)),
    ('classifier', MultinomialNB())])

parameters = 'vec__ngram_range': [(1, 1), (1, 2)],
              'classifier__alpha': [1e-2, 1e-3],
              'classifier__fit_prior': [True, False]

gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, cv=5, scoring="f1", verbose=10)

gs_clf = gs_clf.fit(x, y)

joblib.dump(gs_clf.best_estimator_, 'MultinomialNB.pkl', compress=1)

然后，在另一个文件中，为了对新文档（不是来自语料库）进行分类，我这样做：

  classifier = joblib.load(filepath) # path to .pkl file
  result = classifier.predict(tokenlist)

我的问题是：我在哪里可以获得classification_report 所需的值？

在许多其他示例中，我看到人们将语料库分为训练集和测试集。但是，由于我将GridSearchCV 与 kfold-cross-validation 一起使用，因此我不需要这样做。那么如何从GridSearchCV 获取这些值呢？

【问题讨论】：

只是一个问题，gs_clf.fit(x, y) 不返回None？ @BallpointBen 为什么会这样？ x 和 y 包含数据 【参考方案1】：

最佳模型在 clf.best_estimator_ 中。您需要将训练数据与此相匹配；然后预测您的测试数据并使用 ytest 和 ypreds 进行分类报告。

【讨论】：

感谢您的回复！所以要明确一点：对于 GridSearchCV，我使用所有数据（在我的例子中是 corpus、data 和 corpus.target），但为了获得最佳分类器，我使用 train_test_split 将数据划分为 x_test、X_train、Y_test、Y_train？跨度> 是的。如果您希望分数可靠，那么他们需要根据与用于拟合的数据集不同的数据集进行衡量。或者，如果您有足够的数据，您可以在进行网格搜索之前拆分数据。然后我没有将corpus.data传递给GridSearch，而是只传递X_train？【参考方案2】：

如果你有 GridSearchCV 对象：

from sklearn.metrics import classification_report
clf = GridSearchCV(....)
clf.fit(x_train, y_train)
classification_report(y_test,clf.best_estimator_.predict(x_test))

如果您已保存并加载了最佳估算器，则：

classifier = joblib.load(filepath)
classification_report(y_test,classifier.predict(x_test))

【讨论】：

以上是关于sklearn GridSearchCV：如何获得分类报告？的主要内容，如果未能解决你的问题，请参考以下文章