TypeError：train_test_split() 只有当我在函数中写入参数'test_size'时才获得多个值

Posted 2023-03-12

技术标签:

【中文标题】TypeError：train_test_split() 只有当我在函数中写入参数\'test_size\'时才获得多个值【英文标题】：TypeError: train_test_split() got multiple values for argument 'test_size' only when I write it in a functionTypeError：train_test_split() 只有当我在函数中写入参数'test_size'时才获得多个值 【发布时间】：2022-01-07 19:59:30 【问题描述】：

对于如下所示的数据框df_content：

rated_object     feature_1    feature_2    feature_n    rating
o1               2.02         0            90.40        0
o2               3.70         1            NaN          1
o3               3.45         0            70.50        1
o4               7.90         1            40.30        0
...

我写了以下函数：

import xgboost as xgb
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

def predict_cn(df, rated_object):
    is_target = (df['rated_object'] == rated_object)
    target = df[is_target].iloc[0]
    cols_to_drop = ['rated_object'] 
    df.drop(cols_to_drop, axis=1, inplace=True)
    X = df.drop('rating', axis=1)  
    y = df['rating'] 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)
    model = XGBClassifier() 
    model.fit(X_train, y_train)
    prediction=model.predict(target['rated_object'], verbose=False)
    return prediction

但是像predict_cn(df_content, 'o3') 这样的输入会给我错误：

TypeError                                 Traceback (most recent call last)
<ipython-input-10-08dcbb77df37> in <module>
----> 1 predict_cn(df_content, 'o3')

<ipython-input-9-18667675e17b> in predict_cn(df, rated_object)
      6     X = df.drop('rating', axis=1)
      7     y = df['rating']
----> 8     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)
      9     model = XGBClassifier()
     10     model.fit(X_train, y_train)

TypeError: train_test_split() got multiple values for argument 'test_size'

我觉得很奇怪，因为当我为整个数据框单独运行这样一个模型时，它工作正常，这也是文档中的语法。如果我想输入rated_object 并获得其预测的rating，我也不知道我的其余代码是否正确。

编辑：按照@Antoine Dubuis 的建议，我尝试创建数据框的副本。下面我将发布新函数和一个小的模拟数据框，但错误仍然存在：

from numpy import nan
data_mock = [['q1', 10.93, 20, 1, 0], ['q2', nan, 12, 0, 1], ['q3', 14.34, 30, 0, 1], ['q4', 12.93, 20, 0, 1], ['q5', nan, 62, 1, 0], ['q6', 14.34, 60, 0, 0], ['q7', 16.93, 28, 1, 1], ['q8', nan, 12, 1, 1], ['q9', 10.34, 50, 0, 0], ['q10', 10.93, 20, 0, 0], ['q11', nan, 57, 1, 1], ['q12', 89.34, 30, 0, 0]]
df_mock = pd.DataFrame(data_mock, columns = ['rated_object', 'feature_1', 'feature_2', 'feature_n', 'rating'])
def predict_cn(df, rated_object):
    df_copy=df.copy()
    is_target = (df_copy['rated_object'] == rated_object)
    target = df_copy[is_target].iloc[0]
    cols_to_drop = ['rated_object'] 
    df_copy.drop(cols_to_drop, axis=1, inplace=True)
    X = df_copy.drop('rating', axis=1)  
    y = df_copy['rating']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)
    model = XGBClassifier() 
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test) 
    prediction=model.predict(target['questionId'], verbose=False)
    return prediction

【问题讨论】：

【参考方案1】：

您应该验证变量X 和y。您的 y 数组一定有错误，因为它没有被解释为 pd.Series 而是一个数值。

由于它是一个数值，它被解释为位置参数train_size，而不是必须拆分的arrays参数。

【讨论】：

确实，我打印了 y 并得到 Name: rating, Length: 2895, dtype: int64。我把它写成y = pd.DataFrame(df_content['rating'])，但我仍然得到错误。在我看来，你应该确保你的函数是幂等的。这意味着您可以不修改外部数据。我首先会执行df.copy() 以确保您没有修改原始数据框。这可能是问题所在。如果您的数据框有多行，那么从我的角度来看，您的代码应该可以正常工作。我确实定义了df2=df.copy()，我在开始时尝试了它并没有工作，然后在删除列之前它没有工作。我应该在代码的哪个位置执行复制？我还尝试编写 y = df['rating'].values 以便 y 是一个数组，但它不起作用。我认为我的第一条评论是错误的，因为如果我打印 y 它确实是一个系列，就像我单独打印时一样，正常运行模型而不是写成功能。我认为你对它修改我的数据是正确的，但我不知道在哪里放置副本。你应该把它放在函数的开头，以确保没有任何东西改变你原来的DataFrame。

以上是关于TypeError：train_test_split() 只有当我在函数中写入参数'test_size'时才获得多个值的主要内容，如果未能解决你的问题，请参考以下文章