无法使用 Sklearn 和 Keras Wrappers 使 pipeline.fit() 工作

Posted

技术标签:

【中文标题】无法使用 Sklearn 和 Keras Wrappers 使 pipeline.fit() 工作【英文标题】:Unable to get pipeline.fit() to work using Sklearn and Keras Wrappers 【发布时间】:2018-11-03 23:12:16 【问题描述】:

我收到参数值错误(不足以解压预期的 2 得到 1)我有一个要训练的网络:

def build(self):
    numpy.random.seed(self.seed)
    self.estimators.append(('standardize', StandardScaler))
    self.estimators.append(('mlp', KerasClassifier(build_fn=self.build_fn, epochs=50, batch_size=5, verbose=0)))
    self.pipeline = Pipeline(self.estimators)

现在,如果我想将数据拟合到某些值:比如 self.X、self.Y

self.model = self.pipeline.fit(self.X, self.Y, verbose=1)

我明白了

Traceback (most recent call last):
File "C:/Users/jaehan/PycharmProjects/cerebro/cerebro.py", line 257, in 
<module>
model.run()
File "C:/Users/jaehan/PycharmProjects/cerebro/cerebro.py", line 138, in run
self.model = self.pipeline.fit(self.X, self.Y, verbose=1)
File "C:\Users\jaehan\AppData\Local\Continuum\anaconda3\envs\py36\lib\site- 
packages\sklearn\pipeline.py", line 248, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "C:\Users\jaehan\AppData\Local\Continuum\anaconda3\envs\py36\lib\site- 
packages\sklearn\pipeline.py", line 197, in _fit
step, param = pname.split('__', 1)
ValueError: not enough values to unpack (expected 2, got 1)

我在这里做错了吗?我的印象是我可以运行一次 fit 并且它会返回一个历史对象,我可以随时保存和加载它

我什至试过...

self.pipeline.fit(self.X, self.Y)

哪个抛出...

AttributeError: 'numpy.ndarray' object has no attribute 'fit'

我不知道这里发生了什么。

完整代码

class Cerebro:
    def __init__(self):
        self.model = None
        self.build_fn = None
        self.data = None
        self.X = None
        self.Y = None
        #these three are for encoding string values to integer_encodings / one hot encodings
        self.encoder = LabelEncoder()
        self.encodings = 
        self.one_hot_encodings = 
        self.seed = numpy.random.seed(7) #this is to ensure we have reproducible results.
        self.estimators = []
        self.pipeline = None
        self.kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=self.seed)
        self.cross_validation_score = 0.0

    def preprocess(self):
        """
        This method will preprocess the dataset we want to train our network on. 

        Example:
            import preproccessing
            ...

            dataset, X, Y = preprocessing.main()


        """

        self.data = pandas.read_csv('src_examples/hwtxn_final_for_influx.txt', sep='\t').values
        self.X = numpy.delete(self.data, 13, axis=1)
        self.Y = self.data[:, 13].astype(numpy.float16)

    def build(self):
        self.build_fn = self.base_model()

        self.preprocess()

        numpy.random.seed(self.seed)
        self.estimators.append(('standardize', StandardScaler()))
        self.estimators.append(('mlp', KerasClassifier(build_fn=self.build_fn, epochs=50, batch_size=5, verbose=0)))
        self.pipeline = Pipeline(self.estimators)

    def run(self):
        """This will actually take the pipeline (preprocessing standardization, model)
        and fit it to our dataset (X, Y) (We don't need test/train since we are using stratified k fold cross val.)

        Args:
            None
        Returns:
            None
        """

        # this is the 'model'
        # self.pipeline
        print(type(self.pipeline))
        print(self.X.shape)
        self.model = self.pipeline.fit(self.X, self.Y)



    def load(self, fn):
        """This will load a saved model (history object)

        Args:
            fn (filename): represents saved model file
        Returns:
            model (pkl object): represents model

        """
        return pickle.load(open(fn, 'rb'))

    def save(self, fn):
        """This will save a model (history object)

        Args:
            fn (filename): represents a filename to save the model as
        Returns:
            None
        """
        pickle.dump(self.model, open(fn, 'wb'))

    def encode(self, vals, key):
        """ This method will encode a list of values and take a key (representing column name, or index) to save
        in the class object (self.encodings)
        This will help us keep track of encodings we have for values we need to translate/decipher.

        Args:
            vals(np.array): array of values to encode
            key(str): str representing the key used to encode this particular set of values
        Returns:
            transformed values (np.array) representing the encoded versions of values
        """
        # int encoding for non int values
        self.encodings[key] = self.encoder.fit_transform(vals)
        return self.encoder.fit_transform(vals)

    def decoder(self, vals, key):
        """This method will decode the integer_encodings for class variables. It will take vals which
        represents a list of values to decode (i.e. [1,2,3] -- [apple, pear, orange])
        It will also take a key (since every decoding has a corresponding encoding) to find which encoding
        scheme to map to

        Args:
            vals(np.array) : array of values to decode
            key(str) : string representing the key used for encoding the values (for decoding it)
        Returns:
            inverse transform of encoded values (np.array)
        """
        # translate int encodings to original values (encoder._classes)
        return self.encodings[key].inverse_transform(vals)

    def cross_validate(self):
        """
        This will perform a cross validation score using a stratified kfold method. (Think traditional Kfold but
        with the values evenly distributed for each subsample)

        Args:
            None
        Returns:
            None
        """
        self.cross_validation_score = cross_val_score(self.pipeline, self.X, self.Y, cv=self.kfold)
        return self.cross_validation_score

    @staticmethod
    def base_model():
        """
        This will return a base model for us to try. The good thing about this implementation is that
        when we decide we want something more complex then all we have to do is define a class function and replace
        the values in the build f(x)

        Args:
            None
        Returns:
            model (keras.models.Sequential): Keras based DNN Model
        """

        # create model
        model = Sequential()
        model.add(Dense(60, input_dim=60, kernel_initializer='normal', activation='relu'))
        model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
        # Compile model
        model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        return model

    @staticmethod
    def one_hot_encoder(int_encoding):
        """
        This will take an integer encoding of string variables (traditional preprocessing step, will probably
        move this to the preprocessing package.
        Essential it returns a binary 'one hot' encoding of the values we wish to encode

        Example
        #Dataset Values
        [apple, orange, pear]
        #Integer Encoding
        [1, 2, 3]
        #One Hot Encoding
        [[1, 0, 0]
         [0, 1, 0]
         [0, 0, 1]]

        Args:
            None
        Returns:
            Matrix (np.array): matrix representing one hot vectors for a class of values
        """
        # we might not need this... so for now we will keep it static
        return OneHotEncoder(sparse=False).fit_transform(int_encoding.reshape(len(int_encoding), 1))

if __name__ == '__main__':
    # Step 1 is to initialize class (with seed == 7)
    model = Cerebro()
    model.build()
    model.cross_validate()
    print("Here are our estimators:\n ".format(model.estimators))
    print("Here is our pipeline:\n ".format(model.pipeline))
    model.run()

编辑 答案是 .fit() build_fn 参数需要一个函数指针,而不是模型本身。

恕我直言我觉得应该针对这种情况抛出一个错误。

【问题讨论】:

添加一些数据以便我们重现错误 self 应该专门作为类中的关键字。这真的很令人困惑。你能提供自我的属性吗?除了这与 sklearn-pandas 有什么关系? 最后一个错误“'numpy.ndarray' object has no attribute 'fit' ”表明您在某处更改了管道对象并为其分配了数据数组。显示完整代码 刚刚添加了完整代码@seralouk 对不起,我试图简洁。我应该提到这是一个类实现@Quickbeam2k1 【参考方案1】:

这是由于以下行:

self.build_fn = self.base_model()

这实际上应该是:

self.build_fn = self.base_model

KerasClassifier 需要一个指向创建模型的函数的指针,但通过在末尾附加(),您将build_fn 分配给实际模型,这是错误的。

现在除了上述错误之外,我建议您检查代码中的以下几行,如果不更正,将来您将使用该代码时会出错。

1)self.encodings[key] = self.encoder.fit_transform(vals)

在这里,您将转换后的数据分配给encodings[key],而不是模型。所以当你这样做时:-

self.encodings[key].inverse_transform(vals)

在转换后的数据上调用inverse_transform() 是没有意义的。

inverse_transform() 是一种 scikit-learn 转换器的方法。但是self.encodings[key] 会给出一个ndarray,因为你已经保存了来自fit_transform() 的输出数组。

2) one_hot_encoder() 也发生类似于 2 的事情

错误"AttributeError: 'numpy.ndarray' object has no attribute 'fit'"似乎与1和2有关。

【讨论】:

我想对这个答案投赞成票——但第 2) 和第 3) 点完全错过了问题的范围,并且会在编辑答案后立即投赞成票。 .fit() 的问题在于它与指向模型的 build_fn 参数有关,而不是与函数本身有关(如您所说)您提到的其他两点在此过程中根本没有被调用,并且与问题无关呈现(我知道我们需要在拟合之前进行预处理,这只是为了确保模型正确编译) @codebrotherone 我拒绝了您的编辑并自己编辑了答案以使其清楚。看看吧。 很酷听起来不错——在进行这些编辑时,我已经解决了编码问题,但我感谢您的慷慨!我会说这些错误不冲突或与最初使用 fit() 方法提出的问题有关,该方法不依赖于编码(直到数据开始处理)。这里的全部内容是确保它能够编译,但我会继续并将其标记为可接受的答案尽管如此感谢您的帮助! :)

以上是关于无法使用 Sklearn 和 Keras Wrappers 使 pipeline.fit() 工作的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 mlflow.pyfunc.log_model() 通过 Keras 步骤记录 sklearn 管道?类型错误:无法腌制 _thread.RLock 对象

Keras 神经网络和 SKlearn SVM.SVC

使用 Keras 和 sklearn GridSearchCV 交叉验证提前停止

F1 比在 keras 回调中使用 sklearn 的准确率更高。有问题?

在使用 sklearn 和 keras 构建 CNN 时需要帮助理解形状错误吗?

TensorFlow 低级模型(没有 Keras 和 Sklearn) - 每一步都获得损失 = 0 和准确度 = 100%