在使用管道进行预处理后,如何解决“输入包含 NaN、无穷大或对于 dtype('float64') 而言太大的值”?
Posted
技术标签:
【中文标题】在使用管道进行预处理后,如何解决“输入包含 NaN、无穷大或对于 dtype(\'float64\') 而言太大的值”?【英文标题】:How to solve 'Input contains NaN, infinity or a value too large for dtype('float64')' after already preprocessing using Pipeline?在使用管道进行预处理后,如何解决“输入包含 NaN、无穷大或对于 dtype('float64') 而言太大的值”? 【发布时间】:2021-10-16 01:10:35 【问题描述】:有很多帖子包含这个错误,但我找不到这个问题的解决方案。我正在使用这个dataset。这就是我所做的,使用 SimpleImputer 进行分类和数值特征的预处理:
import pandas as pd
import numpy as np
%load_ext nb_black
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from category_encoders import CatBoostEncoder
from sklearn.model_selection import train_test_split
housing = pd.read_csv("housing.csv")
housing.head()
X = housing.drop(["longitude", "latitude", "median_house_value"], axis=1)
y = housing["median_house_value"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
numeric_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="constant")),
("encoder", CatBoostEncoder()),
]
)
numeric_features = [
"housing_median_age",
"total_rooms",
"total_bedrooms",
"population",
"households",
"median_income",
]
categorical_features = ["ocean_proximity"]
preprocessor = ColumnTransformer(
transformers=[
("numeric", numeric_transformer, numeric_features),
("categorical", categorical_transformer, categorical_features),
]
)
from sklearn.linear_model import LinearRegression
pipeline = Pipeline(
steps=[("preprocessor", preprocessor), ("regressor", LinearRegression())]
)
lr_model = pipeline.fit(X_train, y_train)
但是我收到了这个错误:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
知道这里发生了什么吗?
【问题讨论】:
【参考方案1】:似乎CatBoostEncoder
在适合训练集时返回了几个nan
值,这就是LinearRegression
抛出错误的原因。
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from category_encoders import CatBoostEncoder
housing = pd.read_csv("housing.csv")
X = housing.drop(["longitude", "latitude", "median_house_value"], axis=1)
y = housing["median_house_value"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="constant")),
("encoder", CatBoostEncoder())
])
numeric_features = ["housing_median_age", "total_rooms", "total_bedrooms", "population", "households", "median_income"]
categorical_features = ["ocean_proximity"]
preprocessor = ColumnTransformer(transformers=[
("numeric", numeric_transformer, numeric_features),
("categorical", categorical_transformer, categorical_features),
])
X_new = preprocessor.fit_transform(X_train, y_train)
print(np.isnan(X_new).sum(axis=0))
# array([ 0, 0, 0, 0, 0, 0, 4315])
【讨论】:
您是否单独测试了CatBoostEncoder
,而不是与SimpleImputer
一起测试?因为它似乎在应用于“ocean_proximity”列时它自己工作正常,这有点奇怪......
确实,CatBoostEncoder
本身就可以正常工作。如果您将其拟合到完整数据集,它在管道中也可以正常工作,但由于某种原因,当您仅将管道拟合到训练集时它会中断。
我也看到了。实际上,如果您依次应用SimpleImputer
和CatBoostEncoder
,这一切似乎都很好,即使对于训练数据也是如此。只有管道(或者更具体地说是categorical_transformer
)引入了NaN
值。以上是关于在使用管道进行预处理后,如何解决“输入包含 NaN、无穷大或对于 dtype('float64') 而言太大的值”?的主要内容,如果未能解决你的问题,请参考以下文章
ValueError:在预处理数据时,输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值
输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值。解决办法是啥
ValueError:输入包含 NaN、无穷大或对于 dtype 来说太大的值
如何使用 Numpy 数组解决 Scikit 学习预处理管道错误?
ValueError:使用 KNeighborsRegressor 的拟合,输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值
输入包含 NaN、无穷大或值太大.. 使用 gridsearchcv 时,评分 = 'neg_mean_squared_log_error'