Python sklearn 逻辑回归 K-hold 交叉验证：如何为 coef_ 创建一个框架

Posted 2023-03-12

技术标签:

【中文标题】Python sklearn 逻辑回归 K-hold 交叉验证：如何为 coef_ 创建一个框架【英文标题】：Python sklearn logistic regression K-hold cross-validation : how to create a drameframe for coef_ 【发布时间】：2017-07-25 00:47:25 【问题描述】：

Python3.5

我有一个存储在变量file 中的数据集，我尝试使用逻辑回归应用 10 保持交叉验证。我正在寻找的是列出clf.coef_的平均值的方法。

print(file.head())

   Result  Interest  Limit  Service  Convenience  Trust  Speed 
0       0         1      1        1            1      1      1   
1       0         1      1        1            1      1      1   
2       0         1      1        1            1      1      1   
3       0         4      4        3            4      2      3   
4       1         4      4        4            4      4      4

这是我编写的一个简单的逻辑回归代码，用于显示coef_ 的列表。

[在]

import pandas as pd
from pandas import DataFrame
import numpy as np
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

X = file.drop(['Result'],1)
y = file['Result']

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.25)
clf = LogisticRegression(penalty='l1')
clf.fit(X_train,y_train)
accuracy = clf.score(X_test,y_test)
print(accuracy)

coeff_df = pd.DataFrame([X.columns, clf.coef_[0]]).T
print(coeff_df)

[输出]

0.823061630219  

             0          1
0     Interest   0.163577
1        Limit  -0.161104
2      Service   0.323073
3  Convenience   0.121573
4        Trust   0.370012
5        Speed   0.089934
6        Major   0.183002
7          Ads  0.0137151

然后，我尝试对同一个数据集应用 10 折交叉验证。我在下面有一个代码，但我无法生成 coef_,coeff_df 列表的数据框，就像我在上面的分析中所做的那样。有人可以提供解决方案吗？

[在]

from sklearn.cross_validation import cross_val_score
scores = cross_val_score(clf, X, y, cv=10)
print (scores)
print (np.average(scores))

[输出]

[ 0.82178218  0.7970297   0.84158416  0.80693069  0.84158416  0.80693069
  0.825       0.825       0.815       0.76      ]
0.814084158416

【问题讨论】：

【参考方案1】：

cross_val_score 是一个辅助函数，用于包装 scikit-learn 的各种对象以进行交叉验证（例如 KFold、StratifiedKFold）。它根据使用的scoring 参数返回一个分数列表（对于分类问题，我相信默认情况下会是accuracy）。

cross_val_score 的返回对象不允许您访问交叉验证中使用的底层折叠/模型，这意味着您无法获取每个模型的系数。

要获得交叉验证每一折的系数，您需要使用KFold（或者如果您的类不平衡，则使用StratifiedKFold）。

import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

df = pd.read_clipboard()
file = pd.concat([df, df, df]).reset_index()

X = file.drop(['Result'],1)
y = file['Result']

skf = StratifiedKFold(n_splits=2, random_state=0)

models, coefs = [], []  # in case you want to inspect the models later, too
for train, test in skf.split(X, y):
    print(train, test)
    clf = LogisticRegression(penalty='l1')
    clf.fit(X.loc[train], y.loc[train])
    models.append(clf)
    coefs.append(clf.coef_[0])

pd.DataFrame(coefs, columns=X.columns).mean()

得到我们：

Interest       0.000000
Limit          0.000000
Service        0.000000
Convenience    0.000000
Trust          0.530811
Speed          0.000000
dtype: float64

我必须根据您的示例（只有一个正类实例）来弥补数据。我怀疑这些数字在你的情况下不会是 0。

编辑由于StratifiedKFold（或KFold）为我们提供了数据集的交叉验证拆分，您仍然可以使用模型的score 方法计算交叉验证分数。

以下版本与上面略有不同，以便同时捕获每个折叠的交叉验证分数。

models, scores, coefs = [], [], []  # in case you want to inspect the models later, too
for train, test in skf.split(X, y):
    print(train, test)
    clf = LogisticRegression(penalty='l1')
    clf.fit(X.loc[train], y.loc[train])
    score = clf.score(X.loc[test], y.loc[test])
    models.append(clf)
    scores.append(score)
    coefs.append(clf.coef_[0])

【讨论】：

谢谢！您的代码有效！另一个问题 - 是否有根据您的代码生成分数列表？我想设置“L1 惩罚”，cross_val_score 不会让我这样做。更新了我的答案来解决这个问题

以上是关于Python sklearn 逻辑回归 K-hold 交叉验证：如何为 coef_ 创建一个框架的主要内容，如果未能解决你的问题，请参考以下文章