为啥我的 sklearn.pipeline 中预处理方法的输出不对齐？

Posted 2023-03-12

技术标签:

【中文标题】为啥我的 sklearn.pipeline 中预处理方法的输出不对齐？【英文标题】：Why my output from preprocessing methods in sklearn.pipeline does not align?为什么我的 sklearn.pipeline 中预处理方法的输出不对齐？ 【发布时间】：2018-10-03 21:20:06 【问题描述】：

我正在学习“Hands On Machine Learning”一书并编写一些转换管道的代码来清理我的数据并发现相同管道方法的输出根据我选择输入的数据框的大小而有所不同。代码如下：

from sklearn.base import BaseEstimator,TransformerMixin    
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
      self.attribute_names =attribute_names
    def fit(self,X,y=None):
      return self
    def transform(self,X):
      return X[self.attribute_names].values

from sklearn.pipeline import FeatureUnion

class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, sparse_output=False):
      self.sparse_output = sparse_output
    def fit(self, X, y=None):
      return self
    def transform(self, X, y=None):
      enc = LabelBinarizer(sparse_output=self.sparse_output)
      return enc.fit_transform(X)

num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']

num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', Imputer(strategy='median')),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scalar', StandardScaler())
])

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', CustomLabelBinarizer())
])

full_pipeline = FeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline)
])
housing_prepared = full_pipeline.fit_transform(housing)
data_prepared = full_pipeline.transform(housing.iloc[:5])
data_prepared1 = full_pipeline.transform(housing.iloc[:1000])
data_prepared2 = full_pipeline.transform(housing.iloc[:10000])
print(data_prepared.shape)
print(data_prepared1.shape)
print(data_prepared2.shape)

这三个打印的输出将是 (5, 14) (1000, 15) (10000, 16) 谁能帮我解释一下？

【问题讨论】：

【参考方案1】：

那是因为，在 CustomLabelBinarizer 中，您在每次调用 transform() 时都安装了 LabelBinarizer，因此它每次都会学习不同的标签，因此每次运行时会根据行数学习不同的列数。

改成这样：

class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, sparse_output=False):
      self.sparse_output = sparse_output
    def fit(self, X, y=None):
      self.enc = LabelBinarizer(sparse_output=self.sparse_output)
      self.enc.fit(X)
      return self
    def transform(self, X, y=None):
      return self.enc.transform(X)

现在我在你的代码上得到了正确的形状：

(5, 14)
(1000, 14)
(10000, 14)

注意：been asked here 也有同样的问题。我假设您正在使用link here 作为代码。如果您使用的是任何其他网站，则那里的代码很可能是我链接的旧版本代码。尝试上面链接上的代码以获得无错误的更新版本。

【讨论】：

@NingyuanChen 有帮助请采纳。

以上是关于为啥我的 sklearn.pipeline 中预处理方法的输出不对齐？的主要内容，如果未能解决你的问题，请参考以下文章