ValueError:使用 KNeighborsRegressor 的拟合,输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值

Posted

技术标签:

【中文标题】ValueError:使用 KNeighborsRegressor 的拟合,输入包含 NaN、无穷大或对于 dtype(\'float64\') 来说太大的值【英文标题】:ValueError: Input contains NaN, infinity or a value too large for dtype('float64') using fit from KNeighborsRegressorValueError:使用 KNeighborsRegressor 的拟合,输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值 【发布时间】:2018-08-09 02:05:42 【问题描述】:

在尝试拟合之前,我已经彻底清理了我的数据框,并确保整个数据框没有 inf 或 NaN 值,并且由完全非空的 float64 值组成。但是,我仍然使用 np.isinf()、df.isnull().sum() 和 df.info() 方法冗余地验证了这一点。我所有的研究都表明,其他有同样问题的人在他们的数据框中有 NaN、inf 或 object 数据类型。我的情况并非如此。最后,我找到了一个模糊相似的case,它使用以下代码找到了解决方案:

df = df.apply(lambda x: pd.to_numeric(x,errors='ignore'))

这对我的情况没有帮助。如何解决这个 ValueError 异常?

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np

# Read csv file and assign column names
headers=['symboling','normalized_losses','make','fuel_type','aspiration','num_of_doors',
         'body_style','drive_wheels','engine_location','wheel_base','length','width',
        'height','curb_weight','engine_type','num_of_cylinders','engine_size','fuel_system',
        'bore','stroke','compression_ratio','horsepower','peak_rpm','city_mpg','highway_mpg',
        'price']
cars = pd.read_csv('imports-85.data.txt', names=headers)

# Select only the columns with continuous values from - https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names
continuous_values_cols = ['normalized_losses', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 
                          'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 'price']
numeric_cars = cars[continuous_values_cols].copy()

# Clean Data Set by Convert missing values (?) with np.NaN then set the type to float
numeric_cars.replace(to_replace='?', value=np.nan, inplace=True)
numeric_cars = numeric_cars.astype('float')

# Because the column we're trying to predict is 'price', any row were price is NaN will be removed."
numeric_cars.dropna(subset=['price'], inplace=True)

# All remaining NaN's will be filled with the mean of its respective column
numeric_cars = numeric_cars.fillna(numeric_cars.mean())

# Create training feature list and k value list
test_features = numeric_cars.columns.tolist()
predictive_feature = 'price'
test_features.remove(predictive_feature)
k_values = [x for x in range(10) if x/2 != round(x/2)]

# Normalize columns
numeric_cars_normalized = numeric_cars[test_features].copy()
numeric_cars_normalized = numeric_cars_normalized/ numeric_cars.max()
numeric_cars_normalized[predictive_feature] = numeric_cars[predictive_feature].copy()


def knn_train_test(df, train_columns, predict_feature, k_value):

    # Randomly resorts the DataFrame to mitigate sampling bias
    np.random.seed(1)
    df = df.loc[np.random.permutation(len(df))]

    # Split the DataFrame into ~75% train / 25% test data sets
    split_integer = round(len(df) * 0.75)
    train_df = df.iloc[0:split_integer]
    test_df = df.iloc[split_integer:]

    train_features = train_df[train_columns]
    train_target = train_df[predict_feature]

    # Trains the model
    knn = KNeighborsRegressor(n_neighbors=k_value)
    knn.fit(train_features, train_target)

    # Test the model & return calculate mean square error
    predictions = knn.predict(test_df[train_columns])
    print("predictions")
    mse = mean_squared_error(y_true=test_df[predict_feature], y_pred=predictions)
    return mse


# instantiate mse dict
mse_dict = 

# test each feature and do so with a range of k values
# in an effot to determine the optimal training feature and k value
for feature in test_features:

    mse = [knn_train_test(numeric_cars_normalized,feature, predictive_feature, k) for k in k_values]
    mse_dict[feature] = mse

print(mse_dict)

这是完整的错误追溯:

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
Traceback (most recent call last):
  File "C:\DATAQUEST\06_MachineLearning\01_ML_Fundamentals\06_GuidedProject_PredictingCarPrices\PredictingCarPrices.py", line 76, in <module>
    mse = [knn_train_test(numeric_cars_normalized,feature, predictive_feature, k) for k in k_values]
  File "C:\DATAQUEST\06_MachineLearning\01_ML_Fundamentals\06_GuidedProject_PredictingCarPrices\PredictingCarPrices.py", line 76, in <listcomp>
    mse = [knn_train_test(numeric_cars_normalized,feature, predictive_feature, k) for k in k_values]
  File "C:\DATAQUEST\06_MachineLearning\01_ML_Fundamentals\06_GuidedProject_PredictingCarPrices\PredictingCarPrices.py", line 60, in knn_train_test
    knn.fit(train_features, train_target)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\neighbors\base.py", line 741, in fit
    X, y = check_X_y(X, y, "csr", multi_output=True)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 521, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 407, in check_array
    _assert_all_finite(array)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 58, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

这是我用来验证我的 DataFrame 中没有 NaN 或 inf 值的代码和输出:

# Verify data for NaN and inf
print(len(numeric_cars_normalized))
# 201

print(numeric_cars_normalized.info())
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 201 entries, 0 to 204
# Data columns (total 14 columns):
# bore                 201 non-null float64
# city_mpg             201 non-null float64
# compression_ratio    201 non-null float64
# curb_weight          201 non-null float64
# height               201 non-null float64
# highway_mpg          201 non-null float64
# horsepower           201 non-null float64
# length               201 non-null float64
# normalized_losses    201 non-null float64
# peak_rpm             201 non-null float64
# price                201 non-null float64
# stroke               201 non-null float64
# wheel_base           201 non-null float64
# width                201 non-null float64
# dtypes: float64(14)
# memory usage: 23.6 KB
# None

print(numeric_cars_normalized.isnull().sum())
# bore                 0
# city_mpg             0
# compression_ratio    0
# curb_weight          0
# height               0
# highway_mpg          0
# horsepower           0
# length               0
# normalized_losses    0
# peak_rpm             0
# price                0
# stroke               0
# wheel_base           0
# width                0
# dtype: int64

# The loop below, essentially does the same as the above
# verification, but using different methods
# the purpose is to prove there's no nan or inf in my data set
index = []
NaN_counter = []
inf_counter = []
for col in numeric_cars_normalized.columns:
    index.append(col)
    # inf counter
    col_isinf = np.isinf(numeric_cars_normalized[col])
    if col_isinf.value_counts().index[0] == False:
        inf_counter.append(col_isinf.value_counts()[0])

    # nan counter    
    col_isnan = np.isnan(numeric_cars_normalized[col])
    if col_isnan.value_counts().index[0] == False:
        NaN_counter.append(col_isnan.value_counts()[0])

data_check = 'NOT_NaN_count': NaN_counter, 'NOT_inf_count': inf_counter
data_verification = pd.DataFrame(data=data_check, index=index)
print(data_verification)

#                    NOT_NaN_count  NOT_inf_count
# bore                         201            201
# city_mpg                     201            201
# compression_ratio            201            201
# curb_weight                  201            201
# height                       201            201
# highway_mpg                  201            201
# horsepower                   201            201
# length                       201            201
# normalized_losses            201            201
# peak_rpm                     201            201
# price                        201            201
# stroke                       201            201
# wheel_base                   201            201
# width                        201            201

我可能发现了问题,但仍然不知道如何解决。

# Here's a another methodology for extra redudnant data checking
index = []
NaN_counter = []
inf_counter = []

for col in numeric_cars_normalized.columns:
    index.append(col)
    inf_counter.append(np.any(np.isfinite(numeric_cars_normalized[col])))
    NaN_counter.append(np.any(np.isnan(numeric_cars_normalized[col])))

data_check = 'Any_NaN': NaN_counter, 'Any_inf': inf_counter
data_verification = pd.DataFrame(data=data_check, index=index)
print(data_verification)

                   Any_NaN  Any_inf
# bore                 False     True
# city_mpg             False     True
# compression_ratio    False     True
# curb_weight          False     True
# height               False     True
# highway_mpg          False     True
# horsepower           False     True
# length               False     True
# normalized_losses    False     True
# peak_rpm             False     True
# price                False     True
# stroke               False     True
# wheel_base           False     True
# width                False     True

很明显,我的数据集中有 inf,但我不确定为什么或如何修复它。

【问题讨论】:

您是否也冗余验证了 np.isnan ? 请查看我上面的编辑,我试图验证我的代码中没有 NaN 或 inf 值。 【参考方案1】:

您似乎遇到的问题来自您正在做的排列,通过评论这两行:

# np.random.seed(1)
# df = df.loc[np.random.permutation(len(df))]

这是因为当您清理数据时,最终只有 204 行中的 201 行。通过调试您提供给 knn 函数的数据框,您可以发现,一旦 numeric_cars_normalized 已被置换,其中三行现在对于所有列都是“nan”。

并重新运行代码,您将获得结果。但是您应该做一个额外的更改,因为 knn 更适用于数组,您应该将数据框(系列)更改为具有正确维度的值,然后对它们进行操作。在您的特定情况下,它们都是系列,您可以通过以下方式更改它们:

series.values.reshape(-1, 1)

这里是所有变化的 knn 函数: def knn_train_test(df, train_columns, predict_feature, k_value):

    #print(train_columns, k_value)
    # Randomly resorts the DataFrame to mitigate sampling bias
    #np.random.seed(1)
    #df = df.loc[np.random.permutation(len(df))]

    # Split the DataFrame into ~75% train / 25% test data sets
    split_integer = round(len(df) * 0.75)
    train_df = df.iloc[0:split_integer]
    test_df = df.iloc[split_integer:]

    train_features = train_df[train_columns].values.reshape(-1, 1)
    train_target = train_df[predict_feature].values.reshape(-1, 1)

    # Trains the model
    knn = KNeighborsRegressor(n_neighbors=k_value)
    knn.fit(train_features, train_target)

    # Test the model & return calculate mean square error
    predictions = knn.predict(test_df[train_columns].values.reshape(-1,   1))
    print("predictions")
    mse = mean_squared_error(y_true=test_df[predict_feature], y_pred=predictions)
    return mse

这样,如果我得到正确的输入文件,这就是我得到的:

predictions
'normalized_losses': [100210405.34, 116919980.22444445, 88928383.280000001, 62378305.931836732, 65695537.133086421], 'wheel_base': [10942945.5, 31106845.595555563, 34758670.590399988, 29302177.901632652, 25464306.165925924], 'length': [71007156.219999999, 37635782.111111119, 33676038.287999995, 29868192.295918364, 22553474.111604933], 'width': [42519394.439999998, 25956086.771111108, 15199079.0744, 10443175.389795918, 8440465.6864197534], 'height': [117942530.56, 62910880.079999998, 41771068.588, 33511475.561224483, 31537852.588641971], 'curb_weight': [14514970.42, 6103365.4644444454, 6223489.0728000011, 7282828.3632653067, 6884187.4446913591], 'bore': [57147986.359999999, 88529631.346666679, 68063251.098399997, 58753168.154285707, 42950965.435555562], 'stroke': [145522819.16, 98024560.913333327, 61229681.429599993, 36452809.841224492, 25989788.846172832], 'compression_ratio': [93309449.939999998, 18108906.400000002, 30175663.952, 44964197.869387761, 39926111.747407407], 'horsepower': [25158775.920000002, 17656603.506666664, 13804482.193600001, 15772395.163265305, 14689078.471851852], 'peak_rpm': [169310760.66, 86360741.248888895, 51905953.367999993, 46999120.435102046, 45218343.222716056], 'city_mpg': [15467849.460000001, 12237327.542222224, 10855581.140000001, 11479257.790612245, 11047557.746419754], 'highway_mpg': [17384289.579999998, 15877936.197777782, 7720502.6856000004, 6315372.4963265313, 7118970.4081481481]

【讨论】:

谢谢。您的解决方案有效,但我不知道为什么。如果我对代码的工作原理没有深入的了解,我就不是那种喜欢使用代码的人。我将不得不考虑这一点并做一些额外的研究。这在我的脑海中产生了一系列后续问题。 问题中的代码有两个我看到的问题,其中一个是您直接使用 pandas 数据帧,当您这样做并将它们传递给 knn 时,它将尝试使用数据帧进行操作或直接系列。如果您调试代码并检查 dataframe.shape,您将看到类似 (201,) 而不是您所期望的 (201,1) 的内容。这有时会引起一些麻烦,因此建议您使用 df.values.reshape(-1,1)。 values 会将它们转换为 numpy 数组,并且 reshape 将确保数组的形状有 1 列,而 -1 确保其余的是行 另一个问题与您所做的数据清理有关,它会删除一些行,正如您所指出的,实际上其中有 3 行,但索引仍然存在,当您使用排列时,python仍然找到它们,然后使用索引,即您之前删除的“nan”。另一种解决方案是重新索引您的数据框,以确保不会发生这种情况,但作为额外的清理步骤,我没有尝试过。 感谢您的额外说明!现在我明白了。

以上是关于ValueError:使用 KNeighborsRegressor 的拟合,输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值的主要内容,如果未能解决你的问题,请参考以下文章

ValueError("无法使用 `eval()` 评估张量:

ValueError: '对象对于所需数组来说太深'

ValueError:不能使用 None 作为查询值 django ajax

使用 python 模块请求上传 Strava GPX 时出现 ValueError - 为啥?

ValueError:不支持连续[重复]

导入 keras 时出现 ValueError «您正在尝试使用旧的 GPU 后端»