Python Sklearn“ValueError:分类指标无法处理多类多输出和二进制目标的混合”错误

Posted

技术标签:

【中文标题】Python Sklearn“ValueError:分类指标无法处理多类多输出和二进制目标的混合”错误【英文标题】:Python Sklearn "ValueError: Classification metrics can't handle a mix of multiclass-multioutput and binary targets" error 【发布时间】:2020-12-16 15:18:31 【问题描述】:

I have already visited this answer but didn't understand. 当我使用 test_train_split 函数使用相同的日期集进行测试和训练时,我没有收到此错误。 但是当我尝试使用不同的 csv 文件进行测试和训练时,我得到了这个错误。 link to titanic kaggle competition 有人可以解释为什么我会收到此错误吗?


from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(df,survived_df)
predictions=logreg.predict(test)

from sklearn.metrics import  accuracy_score
accuracy=accuracy_score(test_survived,predictions)   #error here Value Error ""ValueError: Classification metrics can't handle a mix of multiclass-multioutput and binary targets""
print(accuracy)

完全错误

ValueError                                Traceback (most recent call last)

<ipython-input-243-89c8ae1a928d> in <module>
----> 1 logreg.score(test,test_survived)
      2 

~/mldl/kaggle_practice/titanic_pilot/venv/lib64/python3.8/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
    497         """
    498         from .metrics import accuracy_score
--> 499         return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
    500 
    501     def _more_tags(self):

~/mldl/kaggle_practice/titanic_pilot/venv/lib64/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update(k: arg for k, arg in zip(sig.parameters, args))
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~/mldl/kaggle_practice/titanic_pilot/venv/lib64/python3.8/site-packages/sklearn/metrics/_classification.py in accuracy_score(y_true, y_pred, normalize, sample_weight)
    185 
    186     # Compute accuracy for each possible representation
--> 187     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    188     check_consistent_length(y_true, y_pred, sample_weight)
    189     if y_type.startswith('multilabel'):

~/mldl/kaggle_practice/titanic_pilot/venv/lib64/python3.8/site-packages/sklearn/metrics/_classification.py in _check_targets(y_true, y_pred)
     88 
     89     if len(y_type) > 1:
---> 90         raise ValueError("Classification metrics can't handle a mix of 0 "
     91                          "and 1 targets".format(type_true, type_pred))
     92 

ValueError: Classification metrics can't handle a mix of multiclass-multioutput and binary targets

完整代码


df=pd.read_csv('data/train.csv')
test=pd.read_csv('data/test.csv')
test_survived=pd.read_csv('data/gender_submission.csv')
plt.figure(5)
df=df.drop(columns=['Name','SibSp','Ticket','Cabin','Parch','Embarked'])
test=test.drop(columns=['Name','SibSp','Ticket','Cabin','Parch','Embarked'])
sns.heatmap(df.isnull(),),
plt.figure(2)
sns.boxplot(data=df,y='Age')
# from boxplot 75th%ile seems to b 38 n 25th percentile seems to be 20.....
#so multiplying by 1.5  at both ends so Age(10,57) seems good and any value outside this ...lets consider as outliers..
#also using this age for calaculating mean for replacing na values of age.
df=df.loc[df['Age'].between(9,58),]
# test=test.loc[test['Age'].between(9,58),]
# test=test.loc[test['Age'].between(9,58),]

df=df.reset_index(drop=True,)
class_3_age=df.loc[df['Pclass']==3].Age.mean()
class_2_age=df.loc[df['Pclass']==2].Age.mean()
class_1_age=df.loc[df['Pclass']==1].Age.mean()
def remove_null_age(data):
    agee=data[0]
    pclasss=data[1]
    if pd.isnull(agee):
        if pclasss==1:
            return class_1_age
        elif pclasss==2:
            return class_2_age
        else:
            return  class_3_age

    return agee
df['Age']=df[["Age","Pclass"]].apply(remove_null_age,axis=1)
test['Age']=test[["Age","Pclass"]].apply(remove_null_age,axis=1)


sex=pd.get_dummies(df['Sex'],drop_first=True)
test_sex=pd.get_dummies(test['Sex'],drop_first=True)
sex=sex.reset_index(drop=True)
test_sex=test_sex.reset_index(drop=True)
df=df.drop(columns=['Sex'])
test=test.drop(columns=['Sex'])
df=pd.concat([df,sex],axis=1)
test=test.reset_index(drop=True)
df=df.reset_index(drop=True)

test=pd.concat([test,test_sex],axis=1)
survived_df=df["Survived"]
df=df.drop(columns='Survived')
test["Age"]=test['Age'].round(1)
test.at[152,'Fare']=30

from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(df,survived_df)
predictions=logreg.predict(test)

from sklearn.metrics import  accuracy_score
accuracy=accuracy_score(test_survived,predictions)
print(accuracy)

【问题讨论】:

【参考方案1】:

您可能希望获得predictions 以及test_survived 数据框的Survived 列的准确性:

from sklearn.metrics import  accuracy_score
accuracy=accuracy_score(test_survived['Survived'],predictions)
print(accuracy)

您的错误发生了,因为 accuracy_score() 仅采用两个一维数组,一个作为地面实况标签,另一个作为预测标签。但是您提供了一个二维“数组”(数据帧)和一维预测,因此它假设您的第一个输入是多类输出。

documentation 在这方面也很足智多谋。

【讨论】:

以上是关于Python Sklearn“ValueError:分类指标无法处理多类多输出和二进制目标的混合”错误的主要内容,如果未能解决你的问题,请参考以下文章

如何在 sklearn/python 中修复“ValueError: Expected 2D array, got 1D array”?

Python Sklearn“ValueError:分类指标无法处理多类多输出和二进制目标的混合”错误

SKLearn ValueError:使用序列设置数组元素

sklearn.naive_bayes.GaussianNB 中的 ValueError

sklearn中报错ValueError: Expected 2D array, got 1D array instead:

sklearn(错误的输入形状)ValueError