Logistic回归的多个问题(1.所有CV值具有相同的分数,2.分类报告和准确性不匹配)

Posted

技术标签:

【中文标题】Logistic回归的多个问题(1.所有CV值具有相同的分数,2.分类报告和准确性不匹配)【英文标题】:Multiple problems with Logistic Regression (1. all CV values have the same score, 2. classification report and accuracy doesn't match) 【发布时间】:2021-11-23 01:00:35 【问题描述】:

我已经对银行贷款数据实施了逻辑回归。 我已经使用 gridsearchCV 进行超参数调整,并使用多个 kfolds = [3,5,6] 实现了逻辑回归 这是我的代码

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#from google.colab import files
import io

import warnings
warnings.filterwarnings('ignore')
#uploaded = files.upload()

df = pd.read_csv('CleanedLoanData13Cols.csv')

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

X = df.drop('loan_status', axis=1, inplace=False)
y = df['loan_status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 4)
parameters = 'penalty': ['l1', 'l2','elasticnet'],
                  'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                  'solver' : ['liblinear', 'newton-cg', 'lbfgs', 'saga', 'sag'],
                  'multi_class' : ['auto'],
                  'max_iter'    : [5,15,25]
                 

import warnings
warnings.filterwarnings("ignore")

cv_folds = [3, 5, 6]
s_scaler = StandardScaler()
#m_scaler = MinMaxScaler()
#r_scaler = RobustScaler()
s_scaled_X_train = s_scaler.fit_transform(X_train)
s_scaled_X_test = s_scaler.transform(X_test)

for x in cv_folds:
    logmodel = GridSearchCV(LogisticRegression(random_state = 42), parameters, cv = x, scoring = 'accuracy', refit = True)
    logmodel.fit(X_train, y_train)
    
    print('The best score with CV =', x, 'is', logmodel.score(X_test, y_test), 'with parameters =\n\n', logmodel.best_params_, '\n\n')

输出:(第一个问题:这对我来说似乎不对!如果我错了请纠正我?)

The best score with CV = 3 is 0.929636746271388 with parameters =

 'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear' 

The best score with CV = 5 is 0.929636746271388 with parameters =

 'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear' 


The best score with CV = 6 is 0.929636746271388 with parameters =

 'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear' 

继续

results = logmodel.cv_results_

print(results.get('params'))

print(results.get('mean_test_score'))

输出:

[0.9084348         nan        nan 0.8323203         nan 0.83239873
 0.83671225 0.8323203  0.8323203  0.8323203         nan        nan
        nan        nan        nan 0.91647373        nan        nan
 0.8323203         nan 0.902435   0.89474906 0.8520445  0.8323203 and so on

继续:

print(results.get('mean_train_score'))

输出:无

print(logmodel.best_params_)

'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'

print(logmodel.best_score_)

输出:0.9226303384209481(我认为这里也有问题,因为这与分类报告中的准确性不匹配)

final_model = logmodel.best_estimator_

s_predictions = final_model.predict(s_scaled_X_test)

from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix

print(classification_report(y_test, s_predictions))
print(confusion_matrix(y_test, s_predictions))

输出:此处的准确度为 0.62,而顶部为 92

precision    recall  f1-score   support

           0       0.88      0.64      0.74      9197
           1       0.22      0.53      0.31      1732

    accuracy                           0.62     10929
   macro avg       0.55      0.59      0.53     10929
weighted avg       0.77      0.62      0.67     10929

[[5902 3295]
 [ 812  920]]

我不知道我哪里出错了?在过去的几个小时里,我一直在努力解决这个问题,但我无法理解我哪里出错了?如果有人对此提出意见,真的会很感激吗?

【问题讨论】:

【参考方案1】:

这里的问题是您正在将模型拟合到未缩放的数据X_train, y_train

logmodel.fit(X_train, y_train)

然后你试图预测缩放数据s_scaled_X_test 这解释了性能下降。

s_predictions = final_model.predict(s_scaled_X_test)

要解决这个问题,您应该使用缩放数据训练模型,如下所示:

logmodel.fit(s_scaled_X_train, y_train)

【讨论】:

非常感谢,但 CV = [3,5,6] 的最佳分数仍然相同,但这次我得到了 - “0.9385122152072468”。弹出相同的值。有什么解释吗? 这可以通过模型来解释。 LogisticRegression 正在解决最小二乘问题。因此,您正在达到此功能的最小值。

以上是关于Logistic回归的多个问题(1.所有CV值具有相同的分数,2.分类报告和准确性不匹配)的主要内容,如果未能解决你的问题,请参考以下文章

MATLAB 多项 Logistic 回归输入

用R做logistic回归,定性自变量太多导致报错怎么办

如何用SPSS做logistic回归分析

SoftMax 回归(与Logistic 回归的联系与区别)

求问spss20.0 logistic回归分析步骤

Logistic回归模型(C++代码实现)