Python 中的逻辑回归和交叉验证(使用 sklearn)
Posted
技术标签:
【中文标题】Python 中的逻辑回归和交叉验证(使用 sklearn)【英文标题】:Logistic regression and cross-validation in Python (with sklearn) 【发布时间】:2017-07-07 10:45:04 【问题描述】:我正在尝试通过逻辑回归解决给定数据集上的分类问题(这不是问题)。为了避免过度拟合,我试图通过交叉验证来实现它(这就是问题所在):我缺少一些东西来完成程序。我的目的是确定准确性。
但让我具体一点。这就是我所做的:
-
我将集合分为训练集和测试集
我定义了要使用的对数回归预测模型
我使用 cross_val_predict 方法(在 sklearn.cross_validation 中)进行预测
最后,我测量了准确度
代码如下:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cross_validation import train_test_split
from sklearn import metrics, cross_validation
from sklearn.linear_model import LogisticRegression
# read training data in pandas dataframe
data = pd.read_csv("./dataset.csv", delimiter=';')
# last column is target, store in array t
t = data['TARGET']
# list of features, including target
features = data.columns
# item feature matrix in X
X = data[features[:-1]].as_matrix()
# remove first column because it is not necessary in the analysis
X = np.delete(X,0,axis=1)
# divide in training and test set
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)
# define method
logreg=LogisticRegression()
# cross valitadion prediction
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
print(metrics.accuracy_score(t_train, predicted))
我的问题:
据我了解直到最后才考虑测试集并且应该对训练集进行交叉验证。这就是我在 cross_val_predict 方法中插入 X_train 和 t_train 的原因。 Thuogh,我收到一条错误消息:
ValueError: Found input variables with inconsistent numbers of samples: [6016, 4812]
其中6016是整个数据集中的样本数,4812是数据集拆分后训练集中的样本数
在这之后,我不知道该怎么办。我的意思是:X_test 和 t_test 什么时候开始发挥作用?我不明白交叉验证后应该如何使用它们以及如何获得最终的准确性。
额外问题:我还想在交叉验证。我怎样才能做到这一点?我已经看到定义管道可以帮助扩展,但我不知道如何将其应用于第二个问题。
非常感谢任何帮助 :-)
【问题讨论】:
【参考方案1】:这是在示例数据帧上测试的工作代码。代码中的第一个问题是目标数组不是 np.array。您的功能中也不应该有目标数据。下面我说明如何使用 train_test_split 手动拆分训练和测试数据。我还展示了如何使用包装器 cross_val_score 来自动拆分、拟合和评分。
random.seed(42)
# Create example df with alphabetic col names.
alphabet_cols = list(string.ascii_uppercase)[:26]
df = pd.DataFrame(np.random.randint(1000, size=(1000, 26)),
columns=alphabet_cols)
df['Target'] = df['A']
df.drop(['A'], axis=1, inplace=True)
print(df.head())
y = df.Target.values # df['Target'] is not an np.array.
feature_cols = [i for i in list(df.columns) if i != 'Target']
X = df.ix[:, feature_cols].as_matrix()
# Illustrated here for manual splitting of training and testing data.
X_train, X_test, y_train, y_test = \
model_selection.train_test_split(X, y, test_size=0.2, random_state=0)
# Initialize model.
logreg = linear_model.LinearRegression()
# Use cross_val_score to automatically split, fit, and score.
scores = model_selection.cross_val_score(logreg, X, y, cv=10)
print(scores)
print('average score: '.format(scores.mean()))
输出
B C D E F G H I J K ... Target
0 20 33 451 0 420 657 954 156 200 935 ... 253
1 427 533 801 183 894 822 303 623 455 668 ... 421
2 148 681 339 450 376 482 834 90 82 684 ... 903
3 289 612 472 105 515 845 752 389 532 306 ... 639
4 556 103 132 823 149 974 161 632 153 782 ... 347
[5 rows x 26 columns]
[-0.0367 -0.0874 -0.0094 -0.0469 -0.0279 -0.0694 -0.1002 -0.0399 0.0328
-0.0409]
average score: -0.04258093018969249
有用的参考资料:
Convert from pandas to numpySelect all but subset of columns of dataframe
sklearn.model_selection.train_test_split
sklearn.model_selection.cross_val_score【讨论】:
非常感谢,伙计!我修复了代码,现在它可以工作了。功能中的目标并不是真正的问题,因为我的代码中的 :-1 将它拿走了,因为它是最后一列。所以真正的问题应该是目标不是np.array,正如你所指出的那样(也就是说,我真的不明白它与机器返回的大小错误有什么神秘的联系,哈哈)。您对如何完成该过程有任何想法,即如何进行最终测试?我对我现在应该做什么有点困惑。 我修改了我的答案,使用model_selection.cross_val_score
包含一个完整的过程。至于大小错误,在 pd.dataframes 和 np.ndarrays 之间工作可能会很痛苦。您可以使用x.shape
打印每个的暗淡以进行故障排除。学习这些 IMO 内容的最佳方式是深入了解 sklearn 文档和教程。
是的。 cross_val_score 是内置的。默认情况下,它使用K-Folds cross-validator。请参阅this 文章了解更多详细信息。
我知道这是一篇较旧的帖子,但是如果您正在阅读这篇文章,我想检查一下拆分 X_train, y_train 是如何被使用的,因为只有 X,y 被输入到 cross_val_score 中【参考方案2】:
请看documentation of cross-validation at scikit了解更多。
您也错误地使用了cross_val_predict
。它将在内部调用您提供的cv
(cv
=10) 以将提供的数据(即您的情况下的 X_train、t_train)拆分为再次训练和测试,将估计器拟合到训练并预测数据仍在测试中。
现在要使用您的X_test
、y_test
,您应该首先将您的估计器拟合到训练数据上(cross_val_predict 将不拟合),然后用它来预测测试数据,然后计算准确度。
描述上述内容的简单代码 sn-p(借用您的代码)(请阅读 cmets 并询问是否不明白):
# item feature matrix in X
X = data[features[:-1]].as_matrix()
# remove first column because it is not necessary in the analysis
X = np.delete(X,0,axis=1)
# divide in training and test set
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)
# Until here everything is good
# You keep away 20% of data for testing (test_size=0.2)
# This test data should be unseen by any of the below methods
# define method
logreg=LogisticRegression()
# Ideally what you are doing here should be correct, until you did anything wrong in dataframe operations (which apparently has been solved)
#cross valitadion prediction
#This cross validation prediction will print the predicted values of 't_train'
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
# internal working of cross_val_predict:
#1. Get the data and estimator (logreg, X_train, t_train)
#2. From here on, we will use X_train as X_cv and t_train as t_cv (because cross_val_predict doesnt know that its our training data) - Doubts??
#3. Split X_cv, t_cv into X_cv_train, X_cv_test, t_cv_train, t_cv_test by using its internal cv
#4. Use X_cv_train, t_cv_train for fitting 'logreg'
#5. Predict on X_cv_test (No use of t_cv_test)
#6. Repeat steps 3 to 5 repeatedly for cv=10 iterations, each time using different data for training and different data for testing.
# So here you are correctly comparing 'predicted' and 't_train'
print(metrics.accuracy_score(t_train, predicted))
# The above metrics will show you how our estimator 'logreg' works on 'X_train' data. If the accuracies are very high it may be because of overfitting.
# Now what to do about the X_test and t_test above.
# Actually the correct preference for metrics is this X_test and t_train
# If you are satisfied by the accuracies on the training data then you should fit the entire training data to the estimator and then predict on X_test
logreg.fit(X_train, t_train)
t_pred = logreg(X_test)
# Here is the final accuracy
print(metrics.accuracy_score(t_test, t_pred))
# If this accuracy is good, then your model is good.
如果您的数据较少或不想将数据拆分为训练和测试,那么您应该使用@fuzzyhedge 建议的方法
# Use cross_val_score on your all data
scores = model_selection.cross_val_score(logreg, X, y, cv=10)
# 'cross_val_score' will almost work same from steps 1 to 4
#5. t_cv_pred = logreg.predict(X_cv_test) and calculate accuracy with t_cv_test.
#6. Repeat steps 1 to 5 for cv_iterations = 10
#7. Return array of accuracies calculated in step 5.
# Find out average of returned accuracies to see the model performance
scores = scores.mean()
注意 - cross_validation 最好与gridsearch 一起使用,以找出对给定数据表现最佳的估计器参数。 例如,使用LogisticRegression 它定义了许多参数。但是如果你使用
logreg = LogisticRegression()
将仅使用默认参数初始化模型。可能是不同的参数值
logreg = LogisticRegression(penalty='l1', solver='liblinear')
可能对您的数据表现更好。这种对更好参数的搜索是gridsearch。
现在关于scaling, dimension reductions 等使用管道的第二部分。可以参考documentation of pipeline及以下示例:
http://scikit-learn.org/stable/auto_examples/feature_stacker.html#sphx-glr-auto-examples-feature-stacker-py http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html#sphx-glr-auto-examples-plot-digits-pipe-py如果需要任何帮助,请随时与我联系。
【讨论】:
谢谢。非常完整和有用的答案!是的,我正在尝试找出 sklearn 文档中的内容,但我仍然对如何结合之前的拆分和交叉验证感到困惑。现在清楚多了以上是关于Python 中的逻辑回归和交叉验证(使用 sklearn)的主要内容,如果未能解决你的问题,请参考以下文章