将 keras 集成到 sklearn 管道中的问题

Posted

技术标签:

【中文标题】将 keras 集成到 sklearn 管道中的问题【英文标题】:Problem with integrating keras into a sklearn pipeline 【发布时间】:2021-03-06 05:24:06 【问题描述】:

我正在使用来自 sklearn 的包装器来为我的 Keras 模型找到最佳超参数。简而言之,这个模型是一个卷积自动编码器,接收形状为 (x,x,x) 的数据。 Keras 包装器似乎采用 (x,x) 形状的数据。由于它是自动编码器模型,因此数据将采用 (x,x,x) 的形状,我认为由于这个原因,我收到以下错误 ValueError: Invalid shape for y: (3744, 288, 1)。我该如何解决这个问题?

完整代码

"""
# Load libraries
"""
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers
from matplotlib import pyplot as plt

import numpy as np
from sklearn.model_selection import GridSearchCV

from keras.wrappers.scikit_learn import KerasClassifier

# Set random seed
np.random.seed(0)

"""
## Load the data
"""

master_url_root = "https://raw.githubusercontent.com/numenta/NAB/master/data/"

df_small_noise_url_suffix = "artificialNoAnomaly/art_daily_small_noise.csv"
df_small_noise_url = master_url_root + df_small_noise_url_suffix
df_small_noise = pd.read_csv(
    df_small_noise_url, parse_dates=True, index_col="timestamp"
)

df_daily_jumpsup_url_suffix = "artificialWithAnomaly/art_daily_jumpsup.csv"
df_daily_jumpsup_url = master_url_root + df_daily_jumpsup_url_suffix
df_daily_jumpsup = pd.read_csv(
    df_daily_jumpsup_url, parse_dates=True, index_col="timestamp"
)



"""
## Prepare training data
"""


# Normalize and save the mean and std we get,
# for normalizing test data.
training_mean = df_small_noise.mean()
training_std = df_small_noise.std()
df_training_value = (df_small_noise - training_mean) / training_std
print("Number of training samples:", len(df_training_value))

"""
### Create sequences
Create sequences combining `TIME_STEPS` contiguous data values from the
training data.
"""

TIME_STEPS = 288

# Generated training sequences for use in the model.
def create_sequences(values, time_steps=TIME_STEPS):
    output = []
    for i in range(len(values) - time_steps):
        output.append(values[i : (i + time_steps)])
    return np.stack(output)


x_train = create_sequences(df_training_value.values)
print("Training input shape: ", x_train.shape)

"""
## Build a model

We will build a convolutional reconstruction autoencoder model. The model will
take input of shape `(batch_size, sequence_length, num_features)` and return
output of the same shape. In this case, `sequence_length` is 288 and
`num_features` is 1.
"""

# Create function returning a compiled network
def create_network(optimizer='Adam'):
    model = keras.Sequential(
        [
            layers.Input(shape=(x_train.shape[1], x_train.shape[2])),
            layers.Conv1D(
                filters=32, kernel_size=7, padding="same", strides=2, activation="relu"
            ),
            layers.Dropout(rate=0.2),
            layers.Conv1D(
                filters=16, kernel_size=7, padding="same", strides=2, activation="relu"
            ),
            layers.Conv1DTranspose(
                filters=16, kernel_size=7, padding="same", strides=2, activation="relu"
            ),
            layers.Dropout(rate=0.2),
            layers.Conv1DTranspose(
                filters=32, kernel_size=7, padding="same", strides=2, activation="relu"
            ),
            layers.Conv1DTranspose(filters=1, kernel_size=7, padding="same"),
        ]
    )
    model.compile(optimizer=keras.optimizers.optimizer(learning_rate=0.001), loss="mse", metrics=['mae'])

    return model

# Hyper-parameter tuning

# Wrap Keras model so it can be used by scikit-learn
CAE = KerasClassifier(build_fn=create_network, verbose=0)

# Create hyperparameter space
epochs = [5, 10]
batches = [5, 10, 100]
optimizers = ['rmsprop', 'adam']

# Create hyperparameter options
hyperparameters = dict(optimizer=optimizers, epochs=epochs, batch_size=batches)

# Create grid search
grid = GridSearchCV(estimator=CAE, cv=3, param_grid=hyperparameters)

# Fit grid search (we use train data as test data here since this is reconctruction model)
grid_result = grid.fit(x_train, x_train, validation_split=0.1)

# View hyperparameters of best neural network
print(grid_result.best_params_)

【问题讨论】:

我更改了 1 行以使示例正常工作:return model # not return network;您通过optimizer 选择的方式也不正确 - 请修复。 【参考方案1】:

这是KerasClassifier.fit() 的一个特殊问题。如果您查看它的源代码,您会发现如果 y 具有 >2 维,它会引发错误。也许它不是针对自动编码器优化:)

您的选择是:

    子类 KerasClassifier.fit() 并修复此限制 使用另一个优化引擎(我的首选是optuna) 在模型末尾挤出多余的尺寸并减少y_train中的尺寸。

对于 3) 使用这些行:

layers.Reshape((288,))  # add in the end of model constructor

y_train = x_train.reshape(x_train.shape[:-1])  # to match the above change 

grid_result = grid.fit(x_train, y_train, validation_split=0.1)  # feed y_train

【讨论】:

【参考方案2】:

还有一个最优雅的解决方案: 将keras.wrappers.scikit_learn.KerasClassifier 替换为keras.wrappers.scikit_learn.KerasRegressor。后者不检查y 的尺寸。

【讨论】:

以上是关于将 keras 集成到 sklearn 管道中的问题的主要内容,如果未能解决你的问题,请参考以下文章

Keras Sklearn Tuner 模块“sklearn”没有属性“管道”

sklearn 管道 + keras 顺序模型 - 如何获取历史记录?

如何使用 sklearn 管道缩放 Keras 自动编码器模型的目标值?

如何使用 mlflow.pyfunc.log_model() 通过 Keras 步骤记录 sklearn 管道?类型错误:无法腌制 _thread.RLock 对象

如何将不同的输入拟合到 sklearn 管道中?

用于 sklearn 管道的 pandas 到 numpy 数组