使用 python sklearn 的逻辑回归和 GridSearchCV

Posted 2023-03-12

技术标签:

【中文标题】使用 python sklearn 的逻辑回归和 GridSearchCV【英文标题】：logistic regression and GridSearchCV using python sklearn 【发布时间】：2022-01-12 18:20:09 【问题描述】：

我正在尝试来自 page 的代码。我跑到LR (tf-idf) 的部分并得到了类似的结果

之后我决定尝试GridSearchCV。我的问题如下：

#lets try gridsearchcv
#https://www.kaggle.com/enespolat/grid-search-with-logistic-regression

from sklearn.model_selection import GridSearchCV

grid="C":np.logspace(-3,3,7), "penalty":["l2"]# l1 lasso l2 ridge
logreg=LogisticRegression(solver = 'liblinear')
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='f1')
logreg_cv.fit(X_train_vectors_tfidf, y_train)

print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("best score :",logreg_cv.best_score_)

#tuned hpyerparameters :(best parameters)  'C': 10.0, 'penalty': 'l2'
#best score : 0.7390325593588823

然后我手动计算了 f1 分数。为什么不匹配？

logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]
final_prediction=np.where(logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]>=0.5,1,0)
#https://www.statology.org/f1-score-in-python/
from sklearn.metrics import f1_score
#calculate F1 score
f1_score(y_train, final_prediction)
0.9839388145315489

scoring='precision'

f1

precision

#lets try gridsearchcv #https://www.kaggle.com/enespolat/grid-search-with-logistic-regression

from sklearn.model_selection import GridSearchCV

grid="C":np.logspace(-3,3,7), "penalty":["l2"]# l1 lasso l2 ridge
logreg=LogisticRegression(solver = 'liblinear')
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='precision')
logreg_cv.fit(X_train_vectors_tfidf, y_train)

print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("best score :",logreg_cv.best_score_)



/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
tuned hpyerparameters :(best parameters)  'C': 0.1, 'penalty': 'l2'
best score : 0.9474200393672962

logreg_cv

logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]

############################

############更新1

The best score in GridSearchCV is calculated by taking the average score from cross validation for the best estimators. That is, it is calculated from data that is held out during fitting. From what I can tell, you are calculating predicted values from the training data and calculating an F1 score on that. Since the model was trained on that data, that is why the F1 score is so much larger compared to the results in the grid search

这就是我得到以下结果的原因#tuned hpyerparameters :(best parameters) 'C': 10.0, 'penalty': 'l2' #best score : 0.7390325593588823

但是当我手动执行时，我得到 f1_score(y_train, final_prediction) 0.9839388145315489

我尝试按照以下答案中的建议使用f1_micro 进行调整。没有错误信息。我仍然不清楚为什么precision 失败时f1_micro 没有失败

from sklearn.model_selection import GridSearchCV

grid="C":np.logspace(-3,3,7), "penalty":["l2"], "solver":['liblinear','newton-cg'], 'class_weight':[ 0:0.95, 1:0.05 ,  0:0.55, 1:0.45 ,  0:0.45, 1:0.55 , 0:0.05, 1:0.95 ]# l1 lasso l2 ridge
#logreg=LogisticRegression(solver = 'liblinear')
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='f1_micro')
logreg_cv.fit(X_train_vectors_tfidf, y_train)

tuned hpyerparameters :(best parameters)  'C': 10.0, 'class_weight': 0: 0.45, 1: 0.55, 'penalty': 'l2', 'solver': 'newton-cg'
best score : 0.7894909688013136

【问题讨论】：

GridSearchCV 中的最佳分数是通过从交叉验证中获取最佳估计器的平均分数来计算的。也就是说，它是根据拟合期间保留的数据计算得出的。据我所知，您正在根据训练数据计算预测值并计算 F1 分数。由于模型是根据该数据训练的，这就是为什么 F1 分数与网格搜索结果相比要大得多的原因。在 2 号上，这是一个警告，而不是错误。它告诉你y_train中有一些标签没有被预测，所以精度为0。我有大约 55%-45% 的二元分类。为什么它不能预测其中一个标签？ f1 分数也可以正常工作，f1 分数需要精确该模型可能无法很好地预测其中一个类。您可以使用set(y_train) - set(final_prediction) 对此进行测试。如果结果不是空集，则模型不会预测该标签。至于差异，我不确定没有看到数据，但是您可以通过在创建 LogisticRegression 实例时包含 random_state= 来使模型更具重现性。因为我有 55-45% 的拆分，所以两个标签都在预测中。我之前的问题仍然存在 - f1 分数没有任何问题，f1 分数需要精度，所以精度本身应该可以工作 【参考方案1】：

你最终得到了精确的错误，因为你的一些惩罚对于这个模型来说太强了，如果你检查结果，当 C = 0.001 和 C = 0.01 时你得到 0 的 f1 分数

res = pd.DataFrame(logreg_cv.cv_results_)
res.iloc[:,res.columns.str.contains("split[0-9]_test_score|params",regex=True)]
 
                           params  split0_test_score  split1_test_score  split2_test_score
0   'C': 0.001, 'penalty': 'l2'           0.000000           0.000000           0.000000
1    'C': 0.01, 'penalty': 'l2'           0.000000           0.000000           0.000000
2     'C': 0.1, 'penalty': 'l2'           0.973568           0.952607           0.952174
3     'C': 1.0, 'penalty': 'l2'           0.863934           0.851064           0.836449
4    'C': 10.0, 'penalty': 'l2'           0.811634           0.769547           0.787838
5   'C': 100.0, 'penalty': 'l2'           0.789826           0.762162           0.773438
6  'C': 1000.0, 'penalty': 'l2'           0.781003           0.750000           0.763871

你可以检查一下：

lr = LogisticRegression(C=0.01).fit(X_train_vectors_tfidf,y_train)
np.unique(lr.predict(X_train_vectors_tfidf))
array([0])

并且预测的概率会向截距漂移：

# expected probability
np.exp(lr.intercept_)/(1+np.exp(lr.intercept_))
array([0.41764462])

lr.predict_proba(X_train_vectors_tfidf)
 
array([[0.58732636, 0.41267364],
       [0.57074279, 0.42925721],
       [0.57219143, 0.42780857],
       ...,
       [0.57215605, 0.42784395],
       [0.56988186, 0.43011814],
       [0.58966184, 0.41033816]])

对于“获取有关火车数据的预测”的问题，我认为这是唯一的方法。使用最佳参数在整个训练集上重新拟合模型，但不存储预测或预测概率。如果您正在寻找在训练/测试期间获得的值，您可以查看cross_val_predict

【讨论】：

谢谢！你的回答很有道理。 1) 但是为什么我们在使用f1 时不会得到那个错误，而只有在调整precision 时才会得到这个错误？ 2) 你为什么要np.exp(lr.intercept_)/(1+np.exp(lr.intercept_))？当所有x系数都为0时，是否计算概率？ 3) 我使用f1 score 调整了模型，结果低于推荐值

tuned hpyerparameters :(best parameters)  'C': 10.0, 'class_weight': 0: 0.45, 1: 0.55, 'penalty': 'l2', 'solver': 'liblinear'

你认为这是一个非常高的惩罚吗？ best score : 0.7445210598782159 是的，您将截距从 logit 转换为概率。不，它不高。 C 参数是正则化的倒数。你的 C 越高，正则化或惩罚越弱另外，您只收到 C 的某些值的警告。在上面的示例中，您使用来自 gridsearchcv 的最佳参数进行预测，因此它使用的 C 绝对不是 0.001 或 0.01。如果您使用 score='f1_micro' 重新运行搜索，您将看到错误你能回复我的更新1吗？我发现你的回答很有帮助！谢谢

以上是关于使用 python sklearn 的逻辑回归和 GridSearchCV的主要内容，如果未能解决你的问题，请参考以下文章