Tensorflow (Keras) U-Net 分割训练失败

Posted

技术标签:

【中文标题】Tensorflow (Keras) U-Net 分割训练失败【英文标题】:Tensorflow (Keras) U-Net Segmentation training fails 【发布时间】:2021-08-26 10:34:03 【问题描述】:

我正在尝试使用 Tensorflow 和 Keras 训练 U-Net。型号如下图。

def get_unet(IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS):
    inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
    conv1 = Conv2D(32, 3, activation='relu', padding='same', kernel_initializer='he_normal')(inputs)
    conv1 = Conv2D(32, 3, activation='relu', padding='same', kernel_initializer='he_normal')(conv1)
    pool1 = MaxPooling2D(pool_size=(2, 2))(conv1)
    conv2 = Conv2D(64, 3, activation='relu', padding='same', kernel_initializer='he_normal')(pool1)
    conv2 = Conv2D(64, 3, activation='relu', padding='same', kernel_initializer='he_normal')(conv2)
    pool2 = MaxPooling2D(pool_size=(2, 2))(conv2)
    conv3 = Conv2D(128, 3, activation='relu', padding='same', kernel_initializer='he_normal')(pool2)
    conv3 = Conv2D(128, 3, activation='relu', padding='same', kernel_initializer='he_normal')(conv3)
    pool3 = MaxPooling2D(pool_size=(2, 2))(conv3)
    conv4 = Conv2D(256, 3, activation='relu', padding='same', kernel_initializer='he_normal')(pool3)
    conv4 = Conv2D(256, 3, activation='relu', padding='same', kernel_initializer='he_normal')(conv4)
    drop4 = Dropout(0.5)(conv4)
    pool4 = MaxPooling2D(pool_size=(2, 2))(drop4)

    conv5 = Conv2D(512, 3, activation='relu', padding='same', kernel_initializer='he_normal')(pool4)
    conv5 = Conv2D(512, 3, activation='relu', padding='same', kernel_initializer='he_normal')(conv5)
    drop5 = Dropout(0.5)(conv5)

    up6 = Conv2D(256, 2, activation='relu', padding='same', kernel_initializer='he_normal')(UpSampling2D(size=(2, 2))(drop5))
    merge6 = concatenate([drop4, up6], axis=3)
    conv6 = Conv2D(256, 3, activation='relu', padding='same', kernel_initializer='he_normal')(merge6)
    conv6 = Conv2D(256, 3, activation='relu', padding='same', kernel_initializer='he_normal')(conv6)

    up7 = Conv2D(128, 2, activation='relu', padding='same', kernel_initializer='he_normal')(UpSampling2D(size=(2, 2))(conv6))
    merge7 = concatenate([conv3, up7], axis=3)
    conv7 = Conv2D(128, 3, activation='relu', padding='same', kernel_initializer='he_normal')(merge7)
    conv7 = Conv2D(128, 3, activation='relu', padding='same', kernel_initializer='he_normal')(conv7)

    up8 = Conv2D(64, 2, activation='relu', padding='same', kernel_initializer='he_normal')(UpSampling2D(size=(2, 2))(conv7))
    merge8 = concatenate([conv2, up8], axis=3)
    conv8 = Conv2D(64, 3, activation='relu', padding='same', kernel_initializer='he_normal')(merge8)
    conv8 = Conv2D(64, 3, activation='relu', padding='same', kernel_initializer='he_normal')(conv8)

    up9 = Conv2D(32, 2, activation='relu', padding='same', kernel_initializer='he_normal')(UpSampling2D(size=(2, 2))(conv8))
    merge9 = concatenate([conv1, up9], axis=3)
    conv9 = Conv2D(32, 3, activation='relu', padding='same', kernel_initializer='he_normal')(merge9)
    conv9 = Conv2D(32, 3, activation='relu', padding='same', kernel_initializer='he_normal')(conv9)
    conv9 = Conv2D(2, 3, activation='relu', padding='same', kernel_initializer='he_normal')(conv9)
    conv10 = Conv2D(1, 1, activation='sigmoid')(conv9)

    model = Model(inputs=[inputs], outputs=[conv10])

    #model.compile(optimizer = Adam(lr = 1e-3), loss = [dice_coef_loss], metrics = [dice_coef])
    model.compile(optimizer = Adam(lr = 1e-4), loss = [jacard_coef_loss], metrics = [jaccard_distance])
    #model.compile(optimizer=Adam(lr=1e-4), loss='binary_crossentropy', metrics=['accuracy'])

    model.summary()

    return model

问题在于训练(或者可能是数据)。数据是jpg图像。掩码只包含一个类。由于一些小物体,我介绍了 jaccard_distance。培训完成:

seed = 42
np.random.seed = seed

IMG_WIDTH = 512
IMG_HEIGHT = 512
IMG_CHANNELS = 1

TRAIN_PATH = 'data/train/'
TEST_PATH = 'data/test/'

train_ids = next(os.walk(TRAIN_PATH))[1]
test_ids = next(os.walk(TEST_PATH))[1]

X_train = np.zeros((len(train_ids), IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS), dtype=np.float32)
Y_train = np.zeros((len(train_ids), IMG_HEIGHT, IMG_WIDTH, 1), dtype=np.float32)

print('Resizing training images and masks')
for n, id_ in tqdm(enumerate(train_ids), total=len(train_ids)):
    path = TRAIN_PATH + id_
    print(path)
    img = imread(path + '/images/' + id_ + '.jpg', as_gray=True)
    img = img/255
    img = img[:, :, newaxis]
    img = resize(img, (IMG_HEIGHT, IMG_WIDTH), mode='constant', preserve_range=True)
    X_train[n] = img
    mask = np.zeros((IMG_HEIGHT, IMG_WIDTH, 1), dtype=np.float32)
    for mask_file in next(os.walk(path + '/masks/'))[2]:
        mask_ = imread(path + '/masks/' + mask_file, as_gray=True)
        mask_ = np.expand_dims(resize(mask_, (IMG_HEIGHT, IMG_WIDTH), mode='constant',preserve_range=True), axis=-1)
        mask = np.maximum(mask, mask_)

    Y_train[n] = mask

print('Done!')

model = get_unet_middle(IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS)

################################

callbacks = [
    tf.keras.callbacks.EarlyStopping(patience=5, monitor='val_loss'),
    tf.keras.callbacks.ModelCheckpoint('model_for_leak_middle.h5', verbose=1, monitor='val_loss', save_freq='epoch', save_best_only=True)]
results = model.fit(X_train, Y_train, validation_split=0.1, batch_size=4, epochs=50, callbacks=callbacks)

由于使用 jaccard_distance,我使用 np.float32 作为 X_trainY_train 的数据类型。 在降低损失的几个时期之后,损失和 jaccard_distance 突然改变了它们的值。

Train on 180 samples, validate on 21 samples
...
Epoch 20/50
176/180 [============================>.] - ETA: 0s - loss: 0.6334 - jaccard_distance: 63.7578
Epoch 00020: val_loss improved from 0.14278 to 0.14249, saving model to model_for_leak_middle.h5
180/180 [==============================] - 47s 260ms/sample - loss: 0.6345 - jaccard_distance: 63.6719 - val_loss: 0.1425 - val_jaccard_distance: 82.2346
Epoch 21/50
176/180 [============================>.] - ETA: 0s - loss: 0.6041 - jaccard_distance: 63.4807
Epoch 00021: val_loss improved from 0.14249 to 0.14220, saving model to model_for_leak_middle.h5
180/180 [==============================] - 47s 260ms/sample - loss: 0.6041 - jaccard_distance: 63.7875 - val_loss: 0.1422 - val_jaccard_distance: 82.2784
Epoch 22/50
176/180 [============================>.] - ETA: 0s - loss: 0.5860 - jaccard_distance: 63.4502
Epoch 00022: val_loss did not improve from 0.14220
180/180 [==============================] - 47s 259ms/sample - loss: 0.6741 - jaccard_distance: 54.1866 - val_loss: 0.3247 - val_jaccard_distance: 44.9204

完成训练后,历史看起来像......

我做错了什么?

【问题讨论】:

【参考方案1】:

我认为您可能会进行超参数调整,例如您可以将 dropout 降低到 0.1 或 0.2。 我认为您的数据集的大小也可能会影响结果。 如果您有改进,请与我们分享。我也很想知道事情是如何解决的。 万事如意

【讨论】:

以上是关于Tensorflow (Keras) U-Net 分割训练失败的主要内容,如果未能解决你的问题,请参考以下文章

使用 TensorRT 部署语义分割网络(U-Net)(不支持上采样)

Keras深度学习实战(17)——使用U-Net架构进行图像分割

如何在 Keras 的 FCN(U-Net)上使用加权分类交叉熵?

如何在 keras 中正确使用 U-net 批量标准化?

在构建和训练 3D Keras U-NET 时遇到 ValueError

模型输入必须来自 `tf.keras.Input` ...,它们不能是先前非输入层的输出