如何在 Keras 的 BLSTM 中屏蔽填充？

Posted 2023-02-23

技术标签:

【中文标题】如何在 Keras 的 BLSTM 中屏蔽填充？【英文标题】：How do I mask the padding in a BLSTM in Keras? 【发布时间】：2016-10-15 12:21:47 【问题描述】：

我正在运行基于 IMDB example 的 BLSTM，但我的版本不是分类，而是标签的序列预测。为简单起见，您可以将其视为 POS 标记模型。输入是单词的句子，输出是标签。该示例中使用的语法与大多数其他 Keras 示例的语法略有不同，因为它不使用model.add，而是启动一个序列。我不知道如何以这种略有不同的语法添加遮罩层。

我已经运行了模型并对其进行了测试，它运行良好，但它可以预测和评估 0 的准确性，这是我的填充。代码如下：

from __future__ import print_function
import numpy as np
from keras.preprocessing import sequence
from keras.models import Model
from keras.layers.core import Masking
from keras.layers import TimeDistributed, Dense
from keras.layers import Dropout, Embedding, LSTM, Input, merge
from prep_nn import prep_scan
from keras.utils import np_utils, generic_utils


np.random.seed(1337)  # for reproducibility
nb_words = 20000  # max. size of vocab
nb_classes = 10  # number of labels
hidden = 500  # 500 gives best results so far
batch_size = 10  # create and update net after 10 lines
val_split = .1
epochs = 15

# input for X is multi-dimensional numpy array with IDs,
# one line per array. input y is multi-dimensional numpy array with
# binary arrays for each value of each label.
# maxlen is length of longest line
print('Loading data...')
(X_train, y_train), (X_test, y_test) = prep_scan(
    nb_words=nb_words, test_len=75)

print(len(X_train), 'train sequences')
print(int(len(X_train)*val_split), 'validation sequences')
print(len(X_test), 'heldout sequences')

# this is the placeholder tensor for the input sequences
sequence = Input(shape=(maxlen,), dtype='int32')

# this embedding layer will transform the sequences of integers
# into vectors
embedded = Embedding(nb_words, output_dim=hidden,
                     input_length=maxlen)(sequence)

# apply forwards LSTM
forwards = LSTM(output_dim=hidden, return_sequences=True)(embedded)
# apply backwards LSTM
backwards = LSTM(output_dim=hidden, return_sequences=True,
                 go_backwards=True)(embedded)

# concatenate the outputs of the 2 LSTMs
merged = merge([forwards, backwards], mode='concat', concat_axis=-1)
after_dp = Dropout(0.15)(merged)

# TimeDistributed for sequence
# change activation to sigmoid?
output = TimeDistributed(
    Dense(output_dim=nb_classes,
          activation='softmax'))(after_dp)

model = Model(input=sequence, output=output)

# try using different optimizers and different optimizer configs
# loss=binary_crossentropy, optimizer=rmsprop
model.compile(loss='categorical_crossentropy',
              metrics=['accuracy'], optimizer='adam')

print('Train...')
model.fit(X_train, y_train,
          batch_size=batch_size,
          nb_epoch=epochs,
          shuffle=True,
          validation_split=val_split)

更新：

我合并了这个PR 并让它在嵌入层中与mask_zero=True 一起工作。但是在看到模型的糟糕性能后，我现在意识到我还需要在输出中进行屏蔽，其他人建议在 model.fit 行中使用 sample_weight 代替。我怎么能这样做来忽略 0？

更新 2：

所以我读了this 并发现sample_weight 是一个1 和0 的矩阵。我认为它可能一直在工作，但我的准确度停滞在 %50 左右，我刚刚发现它正在尝试预测填充部分，但现在不会将它们预测为 0，就像使用 sample_weight 之前的问题一样。

当前代码：

from __future__ import print_function
import numpy as np
from keras.preprocessing import sequence
from keras.models import Model
from keras.layers.core import Masking
from keras.layers import TimeDistributed, Dense
from keras.layers import Dropout, Embedding, LSTM, Input, merge
from prep_nn import prep_scan
from keras.utils import np_utils, generic_utils
import itertools
from itertools import chain
from sklearn.preprocessing import LabelBinarizer
import sklearn
import pandas as pd


np.random.seed(1337)  # for reproducibility
nb_words = 20000  # max. size of vocab
nb_classes = 10  # number of labels
hidden = 500  # 500 gives best results so far
batch_size = 10  # create and update net after 10 lines
val_split = .1
epochs = 10

# input for X is multi-dimensional numpy array with syll IDs,
# one line per array. input y is multi-dimensional numpy array with
# binary arrays for each value of each label.
# maxlen is length of longest line
print('Loading data...')
(X_train, y_train), (X_test, y_test), maxlen, sylls_ids, tags_ids, weights = prep_scan(nb_words=nb_words, test_len=75)

print(len(X_train), 'train sequences')
print(int(len(X_train) * val_split), 'validation sequences')
print(len(X_test), 'heldout sequences')

# this is the placeholder tensor for the input sequences
sequence = Input(shape=(maxlen,), dtype='int32')

# this embedding layer will transform the sequences of integers
# into vectors of size 256
embedded = Embedding(nb_words, output_dim=hidden,
                     input_length=maxlen, mask_zero=True)(sequence)

# apply forwards LSTM
forwards = LSTM(output_dim=hidden, return_sequences=True)(embedded)
# apply backwards LSTM
backwards = LSTM(output_dim=hidden, return_sequences=True,
                 go_backwards=True)(embedded)

# concatenate the outputs of the 2 LSTMs
merged = merge([forwards, backwards], mode='concat', concat_axis=-1)
# after_dp = Dropout(0.)(merged)

# TimeDistributed for sequence
# change activation to sigmoid?
output = TimeDistributed(
    Dense(output_dim=nb_classes,
          activation='softmax'))(merged)

model = Model(input=sequence, output=output)

# try using different optimizers and different optimizer configs
# loss=binary_crossentropy, optimizer=rmsprop
model.compile(loss='categorical_crossentropy',
              metrics=['accuracy'], optimizer='adam',
              sample_weight_mode='temporal')

print('Train...')
model.fit(X_train, y_train,
          batch_size=batch_size,
          nb_epoch=epochs,
          shuffle=True,
          validation_split=val_split,
          sample_weight=weights)

【问题讨论】：

这是一个老问题，但你解决了吗？我处于同一阶段......我发现accuracy doesn't take sample_weight into account 并且根据我的测试，两者都没有屏蔽（实际上使用屏蔽会产生不同的准确度值，我还无法计算出）。我可能最终会使用功能 API 来构建第二个准确的输出。非常感谢重新审视这个问题并就当前的 Keras 代码对其进行简化。 【参考方案1】：

您解决了这个问题吗？我不太清楚您的代码如何处理填充值和单词索引。让单词索引从 1 开始并定义

embedded = Embedding(nb_words + 1, output_dim=hidden,
                 input_length=maxlen, mask_zero=True)(sequence)

而不是

embedded = Embedding(nb_words, output_dim=hidden,
                 input_length=maxlen, mask_zero=True)(sequence)

根据https://keras.io/layers/embeddings/?

【讨论】：

以上是关于如何在 Keras 的 BLSTM 中屏蔽填充？的主要内容，如果未能解决你的问题，请参考以下文章