在使用管道进行预处理后,如何解决“输入包含 NaN、无穷大或对于 dtype('float64') 而言太大的值”?

Posted

技术标签:

【中文标题】在使用管道进行预处理后,如何解决“输入包含 NaN、无穷大或对于 dtype(\'float64\') 而言太大的值”?【英文标题】:How to solve 'Input contains NaN, infinity or a value too large for dtype('float64')' after already preprocessing using Pipeline?在使用管道进行预处理后,如何解决“输入包含 NaN、无穷大或对于 dtype('float64') 而言太大的值”? 【发布时间】:2021-10-16 01:10:35 【问题描述】:

有很多帖子包含这个错误,但我找不到这个问题的解决方案。我正在使用这个dataset。这就是我所做的,使用 SimpleImputer 进行分类和数值特征的预处理:

import pandas as pd
import numpy as np

%load_ext nb_black

from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from category_encoders import CatBoostEncoder

from sklearn.model_selection import train_test_split

housing = pd.read_csv("housing.csv")
housing.head()

X = housing.drop(["longitude", "latitude", "median_house_value"], axis=1)
y = housing["median_house_value"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant")),
        ("encoder", CatBoostEncoder()),
    ]
)


numeric_features = [
    "housing_median_age",
    "total_rooms",
    "total_bedrooms",
    "population",
    "households",
    "median_income",
]

categorical_features = ["ocean_proximity"]

preprocessor = ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer, numeric_features),
        ("categorical", categorical_transformer, categorical_features),
    ]
)

from sklearn.linear_model import LinearRegression

pipeline = Pipeline(
    steps=[("preprocessor", preprocessor), ("regressor", LinearRegression())]
)

lr_model = pipeline.fit(X_train, y_train)

但是我收到了这个错误:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

知道这里发生了什么吗?

【问题讨论】:

【参考方案1】:

似乎CatBoostEncoder 在适合训练集时返回了几个nan 值,这就是LinearRegression 抛出错误的原因。

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from category_encoders import CatBoostEncoder

housing = pd.read_csv("housing.csv")

X = housing.drop(["longitude", "latitude", "median_house_value"], axis=1)
y = housing["median_house_value"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant")),
    ("encoder", CatBoostEncoder())
])

numeric_features = ["housing_median_age", "total_rooms", "total_bedrooms", "population", "households", "median_income"]
categorical_features = ["ocean_proximity"]

preprocessor = ColumnTransformer(transformers=[
    ("numeric", numeric_transformer, numeric_features),
    ("categorical", categorical_transformer, categorical_features),
])

X_new = preprocessor.fit_transform(X_train, y_train)

print(np.isnan(X_new).sum(axis=0))
# array([   0,    0,    0,    0,    0,    0, 4315])

【讨论】:

您是否单独测试了CatBoostEncoder,而不是与SimpleImputer 一起测试?因为它似乎在应用于“ocean_proximity”列时它自己工作正常,这有点奇怪...... 确实,CatBoostEncoder 本身就可以正常工作。如果您将其拟合到完整数据集,它在管道中也可以正常工作,但由于某种原因,当您仅将管道拟合到训练集时它会中断。 我也看到了。实际上,如果您依次应用SimpleImputerCatBoostEncoder,这一切似乎都很好,即使对于训练数据也是如此。只有管​​道(或者更具体地说是categorical_transformer)引入了NaN 值。

以上是关于在使用管道进行预处理后,如何解决“输入包含 NaN、无穷大或对于 dtype('float64') 而言太大的值”?的主要内容,如果未能解决你的问题,请参考以下文章

ValueError:在预处理数据时,输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值

输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值。解决办法是啥

ValueError:输入包含 NaN、无穷大或对于 dtype 来说太大的值

如何使用 Numpy 数组解决 Scikit 学习预处理管道错误?

ValueError:使用 KNeighborsRegressor 的拟合,输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值

输入包含 NaN、无穷大或值太大.. 使用 gridsearchcv 时,评分 = 'neg_mean_squared_log_error'