sklearn：无法使 OneHotEncoder 与 Pipeline 一起使用

Posted 2023-03-12

技术标签:

【中文标题】sklearn：无法使 OneHotEncoder 与 Pipeline 一起使用【英文标题】：sklearn:Can't make OneHotEncoder work with Pipeline 【发布时间】：2021-11-04 20:28:10 【问题描述】：

我正在使用 ColumnTransformer 为模型构建管道。这就是我的管道的样子，

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder,MinMaxScaler
from sklearn.impute import KNNImputer

imputer_transformer = ColumnTransformer([
    ('knn_imputer',KNNImputer(n_neighbors=5),[0,3,4,6,7])
],remainder='passthrough')

category_transformer = ColumnTransformer([
    ("kms_driven_engine_min_max_scaler",MinMaxScaler(),[0,6]),
    ("owner_ordinal_enc",OrdinalEncoder(categories=[['fourth','third','second','first']],handle_unknown='ignore',dtype=np.int16),[3]),
    ("brand_location_ohe",OneHotEncoder(sparse=False,handle_unknown='ignore'),[2,5]),
],remainder='passthrough')


def build_pipeline_with_estimator(estimator):
    return Pipeline([
    ('imputer',imputer_transformer),
    ('category_transformer',category_transformer),
    ('estimator',estimator),
])

这就是我的数据集的样子，

kms_driven      owner   location    mileage     power    brand              engine  age
34000.0         first       other           NaN         12.0        Yamaha          150.0     9
28000.0         first       other           72.0         7.0         Hero                100.0    16
5947.0           first       other          53.0          19.0       Bajaj                NaN       4
11000.0         first       delhi           40.0          19.8       Royal Enfield   350.0    7
13568.0         first       delhi           63.0          14.0       Suzuki             150.0     5

这就是我在管道中使用 LinearRegression 的方式。

linear_regressor = build_pipeline_with_estimator(LinearRegression())

linear_regressor.fit(X_train,y_train)

print('Linear Regression Train Performance.\n')
print(model_perf(linear_regressor,X_train,y_train))

print('Linear Regression Test Performance.\n')
print(model_perf(linear_regressor,X_test,y_test))

现在，每当我尝试对管道应用线性回归时，都会出现此错误，

ValueError: 无法将字符串转换为浮点数：'bangalore'

“banglore”是位置功能中的价值之一，我正在尝试一次性编码，但它失败了，我无法弄清楚这里出了什么问题。任何帮助将不胜感激。

【问题讨论】：

@MichaelSzczesny，正如你所说，我已经更新了我的问题。管道之间没有任何转换后的数据。我只在管道中做所有事情。请原谅我的错误，我正在努力学习这些东西。 @MichaelSzczesny 我不确定该链接是否是我正在寻找的。我想要的一件事是这是将 OneHotEncoding 与管道一起使用的方式？Sklearn 的文档对此不是很好。 【参考方案1】：

在通过 imputer 后，未估算的列向右移动，如 the documentation 下的注释中所述：

未指定的原始特征矩阵的列是从生成的转换后的特征矩阵中删除，除非在 passthrough 关键字中指定。用指定的那些列直通被添加到变压器输出的右侧。

我们可以先用 imputer 试试：

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression

imputer_transformer = ColumnTransformer([
    ('knn_imputer',KNNImputer(n_neighbors=5),[0,3,4,6,7])
],remainder='passthrough')

我们可以使用示例数据进行尝试，您会看到您的分类列现在向右移动：

X_train = pd.DataFrame('kms':[0,1,2],'owner':['first','first','second'],
'location':['other','other','delhi'],'mileage':[9,8,np.nan],
'power':[3,2,1],'brand':['A','B','C'],'engine':[10,100,1000],'age':[3,4,5])

imputer_transformer.fit_transform(X_train)
Out[25]: 
array([[0.0, 9.0, 3.0, 10.0, 3.0, 'first', 'other', 'A'],
       [1.0, 8.0, 2.0, 100.0, 4.0, 'first', 'other', 'B'],
       [2.0, 8.5, 1.0, 1000.0, 5.0, 'second', 'delhi', 'C']], dtype=object)

在您的情况下，您可以看到 engine 列现在是第四列，而您的序号是第五列，是最后两列，所以一个简单的解决方案可能是：

category_transformer = ColumnTransformer([
    ("kms_driven_engine_min_max_scaler",MinMaxScaler(),[0,3]),
    ("owner_ordinal_enc",OrdinalEncoder(categories=[['fourth','third','second','first']],
handle_unknown='ignore',dtype=np.int16),[5]),
    ("brand_location_ohe",OneHotEncoder(sparse=False,handle_unknown='ignore'),[6,7]),
],remainder='passthrough')

y_train = [7,3,2]

linear_regressor = build_pipeline_with_estimator(LinearRegression())

linear_regressor.fit(X_train,y_train)

【讨论】：

感谢您的详细解释。现在我明白我在哪里犯了错误。再次感谢您的努力。

以上是关于sklearn：无法使 OneHotEncoder 与 Pipeline 一起使用的主要内容，如果未能解决你的问题，请参考以下文章