在测试模型时,k 交叉验证中的 x 和 y 具有不同的行数
Posted
技术标签:
【中文标题】在测试模型时,k 交叉验证中的 x 和 y 具有不同的行数【英文标题】:x and y in k cross validation have different number of rows when testing the model 【发布时间】:2020-09-22 08:59:40 【问题描述】:使用泰坦尼克号的火车和测试数据集,我试图根据乘客的性别来预测泰坦尼克号上是否有乘客幸存下来。我想建立一个分类,然后对其进行测试和评估,以实现我的目标。
但我收到此错误:
ValueError: 发现样本数量不一致的输入变量:[418, 891]
从这一行:
scores = cross_val_score(Model, cross_val_X, cross_val_Y, cv=5, scoring='accuracy')
我了解 cross_val_X、cross_val_Y 的行数不同,因此会出现错误。我是对还是错?我应该怎么做才能修复错误?
我还想在测试数据集上测试我的模型,我认为我需要更改我提供的预测方法的数据。是这样吗?
import pandas as pd #data processing, CSV File(I/O)
import numpy as np #linear algebra
from google.colab import files
import io
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier # K-Nearest Neighbours
from sklearn.metrics import classification_report #Build a text report showing the classification metrics.
from sklearn.metrics import accuracy_score #Accuracy classification score.
from sklearn.metrics import confusion_matrix #Compute confusion matrix to evaluate the accuracy of a classification.
#Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
#for cross validation , import k-folder
from sklearn.model_selection import cross_val_score
#upload the file to train the model
uploaded = files.upload() #select the path to train.csv => upload from local drive
df_train = pd.read_excel('train_updated.xlsx')
df_train2 = df_train.copy()
#upload the files to test the model
uploaded = files.upload() #select the path to train.csv => upload from local drive
df_test = pd.read_csv('test_updated.csv', delimiter=';') #reads our data and saves it in a data structure called DataFrame, splits into columns
print('\n Head of the file: train_updated.xlsx')
print(df_train.head()) #print the head(=the first 5 rows) of the csv, to see features and target variable
print('\n Data info of the file: train_updated.xlsx') #to see if there is any NaN value and length of this data
print(df_train.info() )
print('\n Data info of the file: test') #to see if there is any NaN value and length of this data
print(df_test.info() )
#1st pivot
print('How many women and men survived?')
sex_pivot = df_train2.pivot_table(index="Sex",values="Survived")
sex_pivot.plot.bar()
plt.show()
#replace all nan with 0
df_train.replace(np.nan, 0, inplace=True)
df_test.replace(np.nan, 0, inplace=True)
#convert to int
df_test['Embarked'].replace(( 'S': 0, 'C': 1, 'Q': 2), inplace=True)
#df_test = df_test.drop(columns=['PassengerId', 'Name', 'Sex', 'Cabin', 'Ticket', 'Fare'])
#print('\n AFTER DROPPING COLUMNS \n FILE: test')
#print(df_test.info)
#Splitting data
#Our input will be every column except ‘Survived’ because ‘Survived’ is what we will be attempting to predict. Therefore, ‘Survived’ will be our target.
#separate target values(Y)
Y = df_train['Survived'].values.reshape(-1, 1)
print('\n Y: target value') #view target values
print(Y.shape)
#convert to int
df_train['Embarked'].replace(( 'S': 0, 'C': 1, 'Q': 2), inplace=True)
#separate input values(X)
df_train = df_train.drop(columns=['Survived', 'PassengerId', 'Name', 'Sex', 'Cabin', 'Ticket', 'Fare'])
print('\n AFTER DROPPING COLUMNS \n file: train.csv')
print(df_train.info)
X = df_train['Sex_Boolean'].values.reshape(-1, 1) #create a dataframe with all training data except the target column
print('\n X: input data and shape ')
print(X)
print(X.shape)
#train_test_split: splits data arrays into two subsets: for training data and for testing data
#1st parameter= input data, 2nd parameter= data target
#train_test_split will split our data set and will return 4 values, the train attributes (X_train), test attributes (X_test), train labels (y_train) and the test labels (y_test).
X_train, X_test, y_train, y_test = train_test_split(X, Y , train_size=0.7, test_size=0.3) # 70% training and 30% test .
print('After: Train data split')
print('X_train: ', X_train.shape)
print('X_test: ', X_test.shape)
print('y_train: ', y_train.shape)
print('y_test: ', y_test.shape )
#OPTIMAL K --> PLOT
# try K=1 through K=25 and record testing accuracy
k_range = range(1, 26)
# We can create Python dictionary using [] or dict()
scores = []
# We use a loop through the range 1 to 26
# We append the scores in the dictionary
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
scores.append(accuracy_score(y_test, y_pred))
# allow plots to appear within the notebook
%matplotlib inline
# plot the relationship between K and testing accuracy
plt.plot(k_range, scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Accuracy')
# K-Nearest Neighbours Algorithm
Model = KNeighborsClassifier(n_neighbors=3) #initialization
Model.fit(X_train, y_train) #train the model
y_pred = Model.predict(X_test)
# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
# Accuracy score
print('Accuracy is ',accuracy_score(y_pred,y_test)) # round(knn.score(X_train, Y_train) * 100, 2)
#K-fold cross-validation
cross_val_knn = KNeighborsClassifier(n_neighbors=3)
cross_val_X
try:
cross_val_X=(df_test['Sex_Boolean'].values.reshape(-1, 1) )# df_test['Pclass','Age','SibSp','Parch','Embarked','Sex_Boolean'] pd.get_dummies(
except KeyError:
print("column sex boolean cannot be found")
print( "cross val x: ", cross_val_X )
cross_val_Y= Y
print( "cross val y: ", cross_val_Y )
print( "SHAPE X AND Y : ", cross_val_X.shape, cross_val_Y.shape )
# X,y will automatically devided by 5 folder, the scoring I will still use the accuracy
scores = cross_val_score(Model, cross_val_X, cross_val_Y, cv=5, scoring='accuracy')
结果:
Head of the file: train_updated.xlsx
PassengerId Survived Pclass ... Cabin Embarked Sex_Boolean
0 1 0 3 ... NaN S 1
1 2 1 1 ... C85 C 0
2 3 1 3 ... NaN S 0
3 4 1 1 ... C123 S 0
4 5 0 3 ... NaN S 1
[5 rows x 13 columns]
Data info of the file: train_updated.xlsx
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null int64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
12 Sex_Boolean 891 non-null int64
dtypes: float64(1), int64(7), object(5)
memory usage: 90.6+ KB
None
Data info of the file: test
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null object
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 417 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
11 Sex_Boolean 418 non-null int64
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB
None
How many women and men survived?
Y: target value
(891, 1)
AFTER DROPPING COLUMNS
file: train.csv
<bound method DataFrame.info of Pclass Age SibSp Parch Embarked Sex_Boolean
0 3 22.0 1 0 0 1
1 1 38.0 1 0 1 0
2 3 26.0 0 0 0 0
3 1 35.0 1 0 0 0
4 3 35.0 0 0 0 1
.. ... ... ... ... ... ...
886 2 27.0 0 0 0 1
887 1 19.0 0 0 0 0
888 3 0.0 1 2 0 0
889 1 26.0 0 0 1 1
890 3 32.0 0 0 2 1
[891 rows x 6 columns]>
X: input data and shape
[[1]
[0]
[0]
[0]
[1]
[1]
[1]
[1]
....
[1]
[1]]
(891, 1)
After: Train data split
X_train: (623, 1)
X_test: (268, 1)
y_train: (623, 1)
y_test: (268, 1)
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:136: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
..
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:136: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
precision recall f1-score support
0 0.00 0.00 0.00 174
1 0.35 1.00 0.52 94
accuracy 0.35 268
macro avg 0.18 0.50 0.26 268
weighted avg 0.12 0.35 0.18 268
Accuracy is 0.35074626865671643
cross val x: [[1]
[0]
[1]
[1]
....
[1]
[0]
[1]
[1]
[1]]
cross val y: [[0]
[1]
[1]
[1]...
[0]
[0]
[1]
[0]
[1]
[0]]
SHAPE X AND Y : (418, 1) (891, 1)
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:136: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
ValueError Traceback (most recent call last)
<ipython-input-24-7748c3e3a4a7> in <module>()
184
185 # X,y will automatically devided by 5 folder, the scoring I will still use the accuracy
--> 186 scores = cross_val_score(Model, cross_val_X, cross_val_Y, cv=5, scoring='accuracy')
187
188 # print all 3 times scores
3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
210 if len(uniques) > 1:
211 raise ValueError("Found input variables with inconsistent numbers of"
--> 212 " samples: %r" % [int(l) for l in lengths])
213
214
ValueError: Found input variables with inconsistent numbers of samples: [418, 891]
【问题讨论】:
【参考方案1】:这个问题有点难以理解,因为代码非常多。虽然看起来Y
是:
Y = df_train['Survived'].values.reshape(-1, 1)
然后将其分配给cross_val_Y= Y
。而cross_val_X
来自df_test
:
cross_val_X=(df_test['Sex_Boolean'].values.reshape(-1, 1)
所以看起来它们确实会有不同的形状,这可以解释这个问题,因为如文档中所述,预期的数组必须具有形状:
X: 类似数组的形状 (n_samples, n_features) 要拟合的数据。例如可以是列表或数组。
y: 类似数组的形状 (n_samples,) 或 (n_samples, n_outputs),默认=None 在监督学习的情况下尝试预测的目标变量。
所以n_samples
的样本数量必须相同。
【讨论】:
所以,我应该写这样的东西,而不是 (-1, 1) : (418, 1) ?我也发布了结果。 不,您应该使用相同大小的数组进行交叉验证@AlwaysLearning 所以X
和Y
都来自df_train
。 @AlwaysLearning 有意义吗?不要忘记,如果有帮助,您可以投票并接受答案。见What should I do when someone answers my question?,谢谢!
是的,这完全正确,由cross_val_score
内部处理。我建议你阅读它,例如machinelearningmastery.com/k-fold-cross-validation@always
交叉验证使用 kfold 对多个未见样本的分数进行平均。请阅读它,它是相当长的解释。但关键是它不会计算所见样本的分数@always【参考方案2】:
作为@yatu 所说的额外内容,cross_val_score
应该将model, X,Y
作为参数,您不需要再次拟合不同的值link to cross_val_score
看看他们呈现的代码sn-p
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_score
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
lasso = linear_model.Lasso()
print(cross_val_score(lasso, X, y, cv=3))
如果您想按您所说的那样测量保持集(或 test_set)的性能,您应该在代码中执行以下操作,并可能更改 cross_val_score
中的 scorer
参数:
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
score = cross_val_score(knn, X_train, y_train, cv=3, scorer=None)
scores.append(score)
cross_val_score
本身已经在对 hold_out 集进行预测,因此您无需执行 preds = knn.predict(X_test); accuracy_score(preds, y_test)
【讨论】:
我想用模型以前没有见过的数据来评估模型。这就是为什么我将 df_test 用于 cross_val_X。但问题是在测试数据集中没有幸存的列,因此错误@DaveR 我修改了我的答案来解释你应该如何/在哪里使用cross_val_score
@AlwaysLearning以上是关于在测试模型时,k 交叉验证中的 x 和 y 具有不同的行数的主要内容,如果未能解决你的问题,请参考以下文章