LSTM 词预测模型仅预测最频繁的词，或用于不平衡数据的损失

Posted 2023-03-28

技术标签:

【中文标题】LSTM 词预测模型仅预测最频繁的词，或用于不平衡数据的损失【英文标题】：LSTM word prediction model predicts only the most frequent words, or which loss to use for imbalanced data 【发布时间】：2019-11-30 22:26:21 【问题描述】：

我决定尝试使用循环神经网络构建单词预测模型。网上有很多不同的例子，包括在线课程，听起来构建这样一个模型相当容易。他们中的大多数使用 LSTM。此外，大多数（如果不是全部）使用非常小的数据集。我决定尝试使用更大的数据集，即 20 News Groups 数据集from sklearn.datasets import fetch_20newsgroups。我做了一些最小的预处理：删除标点符号、停用词和数字。

我根据前面 10 个单词的历史来预测一个单词。我只使用至少有 11 个单词的帖子。对于每个帖子，我通过一个大小为 11 的滑动窗口并沿帖子滑动来构建一个训练集。对于每个位置，前 10 个单词是预测变量，第 11 个单词是目标。我整理了一个简单的模型：嵌入层、LSTM 层和输出密集层。代码如下：

def make_prediction_sequences(input_texts, max_nb_words, sequence_length = 10):
# input_texts is a list of strings/texts

# select top vocab_size words based on the word counts
# word_index is the dictionary used to transform the words into the tokens. 
    tokenizer = Tokenizer(oov_token='UNK',num_words=max_nb_words)
    tokenizer.fit_on_texts(input_texts)
    sequences = tokenizer.texts_to_sequences(input_texts)

    prediction_sequences = []
    for sequence in sequences:
        if len(sequence) > sequence_length: # at least 1 for prediction
            for j in range(0,len(sequence) - sequence_length):
                prediction_sequences.append(sequence[j:sequence_length+j+1])

    word_index = e:i-1 for e,i in tokenizer.word_index.items()  if i <= max_nb_words # i-1 because tokenizer is 1 indexed


    return (np.array(prediction_sequences) , word_index)

def batch_sequence_data(prediction_sequences, batch_size, sequence_length, vocab_size):
    number_batches = int(len(prediction_sequences)/batch_size)
    while True:
        for i in range(number_batches):
            X = prediction_sequences[i*batch_size:(i+1)*batch_size, 0:sequence_length]
            Y = to_categorical(prediction_sequences[i*batch_size:(i+1)*batch_size, sequence_length], num_classes=vocab_size)
            yield np.array(X),Y

VOCAB_SIZE = 15000
SEQUENCE_LENGTH = 10
BATCH_SIZE = 128
prediction_sequences, word_index = make_prediction_sequences(data, VOCAB_SIZE, sequence_length=SEQUENCE_LENGTH)

## define the model
EMBEDDING_DIM = 64
rnn_size = 32

sequence_input = Input(shape=(SEQUENCE_LENGTH,), dtype='int32', name='rnn_input')
embedding_layer = Embedding(len(word_index), EMBEDDING_DIM, input_length=SEQUENCE_LENGTH)
embedded_sequences = embedding_layer(sequence_input)
x = LSTM(rnn_size, use_bias=True)(embedded_sequences)
preds = Dense(VOCAB_SIZE, activation='softmax')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['categorical_accuracy'])

#train the model
steps_per_epoch = len(prediction_sequences)/(BATCH_SIZE * SEQUENCE_LENGTH)
earlystop = EarlyStopping(patience=3, restore_best_weights=True,monitor='loss')
history = model.fit_generator(batch_sequence_data(prediction_sequences, BATCH_SIZE, SEQUENCE_LENGTH, VOCAB_SIZE), 
                    steps_per_epoch = steps_per_epoch, epochs=30, callbacks=[earlystop])

训练达到了 ~0.1 的准确度。当我应用该模型从训练数据中预测 10 个单词 sn-ps 的单词时，输出绝大多数是最常见的单词“one”。

我尝试了一个更复杂的模型，它有 2 个 LSTM 层、2 个 Dense 层。我尝试使用 gensim word2vec 模型使用预训练的词嵌入。准确率总是~0.1，大多数时候预测是“一”。

当我考虑时，这有点道理。预测不平衡数据的最常见类别可以“免费”提供高精度。这显然是一个局部最小值，但很难逃脱。问题是，该算法不会最小化准确性，它会最小化损失，即 categorical_crossentropy，它应该适用于不平衡的数据。但是，也许这并不总是正确的，并且有不同的损失可以用来更好地处理不平衡的数据？

【问题讨论】：

您可以尝试使用像 glove 这样的预训练嵌入，它需要大量数据（数十亿个令牌）来训练。你的方法似乎对这项任务来说太简单了 @meowongac，谢谢你的建议。我尝试使用 GoogleNews-vectors-negative300.bin，结果相同。我同意你的观点，这种方法似乎太简单了，但由于它在很多地方都有宣传，我决定自己尝试一下，以获得真实世界的尺寸数据。 【参考方案1】：

在四处寻找之后，我发现了一个 research paper 引入了焦点损失，并且方便地，一个用于 keras 的 github 实现。

再加上@meowongac 的建议（我使用了 Google word2vec 嵌入），可以更好地采样频率较低的单词。

我还单独使用了class_weight：

model.fit_generator(batch_sequence_data(prediction_sequences, 
                    BATCH_SIZE, SEQUENCE_LENGTH, VOCAB_SIZE), 
                    steps_per_epoch = steps_per_epoch, epochs=30, callbacks=[earlystop],
                    class_weight = class_weight)

我设置的与词频成反比。同样，结合使用 Google 词嵌入，它在某种意义上更有效，可以找到频率较低的词。

例如，对于 10 个单词的序列：

['two', 'three', 'marines', 'sort', 'charges', 'pending', 'another', 'fight', 'week', 'interesting']

gamma = 5 的focal loss 方法预测下一个单词people，class_weight 方法预测attorney

【讨论】：

以上是关于LSTM 词预测模型仅预测最频繁的词，或用于不平衡数据的损失的主要内容，如果未能解决你的问题，请参考以下文章