R:如何结合 Word2Vec 嵌入和 LSTM 网络

Posted

技术标签:

【中文标题】R:如何结合 Word2Vec 嵌入和 LSTM 网络【英文标题】:R: how to combine Word2Vec Embedding and LSTM Network 【发布时间】:2021-10-13 13:41:36 【问题描述】:

我计划使用 Word2Vec (Skip-gram) 和 LSTM 进行文本分类。对于代码,我参考了Word Embeddings with Keras 和How to prepare data for NLP (text classification) with Keras and TensorFlow。但是,我不确定如何将这两个步骤结合起来。

目前,我有以下代码。我假设第一个块中的代码会生成一个嵌入矩阵,稍后我可以将其用于文本分类。

#clean textual data 
essay <- tolower(data$corrected) %>%
  text_clean() # removing punctionations, stop words, spaces etc. 

tokenizer <- text_tokenizer(num_words = max_features)

tokenizer%>%
  fit_text_tokenizer(essay)

skipgrams_generator <- function(text, tokenizer, window_size, negative_samples) 
  
  gen <- texts_to_sequences_generator(tokenizer, sample(text))
  
  function() 
    skip <- generator_next(gen) %>%
      skipgrams(
        vocabulary_size = tokenizer$num_words, 
        window_size = window_size, 
        negative_samples = 1
      )
    
    x <- transpose(skip$couples) %>% map(. %>% unlist %>% as.matrix(ncol = 1))
    y <- skip$labels %>% as.matrix(ncol = 1)
    
    list(x, y)
  


# determine model tuning inputs
embedding_size <- 256  # dimension of embedding vector (explianation of how to decide the embedding size https://***.com/questions/48479915/what-is-the-preferred-ratio-between-the-vocabulary-size-and-embedding-dimension)
skip_window <- 5       # number of skip-gram
num_sampled <- 2       # number of negative sample for each word (https://stats.stackexchange.com/questions/244616/how-does-negative-sampling-work-in-word2vec)

input_target <- layer_input(shape = 1)
input_context <- layer_input(shape = 1)

embedding <- layer_embedding(
  input_dim = tokenizer$num_words + 1, 
  output_dim = embedding_size, 
  input_length = 1, 
  name = "embedding"
)


target_vector <- input_target %>% 
  embedding() %>% 
  layer_flatten() # to return the dimension of the input

context_vector <- input_context %>%
  embedding() %>%
  layer_flatten()

dot_product <- layer_dot(list(target_vector, context_vector), axes = 1)

output <- layer_dense(dot_product, units = 1, activation = "sigmoid")

model <- keras_model(list(input_target, input_context), output)
model %>% compile(loss = "binary_crossentropy", optimizer = "adam")

#Model Training 
model %>%
  fit_generator(
    skipgrams_generator(essay, tokenizer, skip_window, negative_samples),
    steps_per_epoch = 100, epochs = 30
    )

#Obtaining Weights for Word Embeddings
embedding_matrix <- get_weights(model)[[1]]

words <-data_frame(
  word = names(tokenizer$word_index), 
  id = as.integer(unlist(tokenizer$word_index))
)

words <- words %>%
  filter(id <= tokenizer$num_words) %>%
  arrange(id)

row.names(embedding_matrix) <- c("UNK", words$word)

dim(embedding_matrix)

那么,我希望在 LSTM 模型中使用这个嵌入矩阵。

text_seqs <- texts_to_sequences(tokenizer, essay)
text_seqs <- pad_sequences(text_seqs, maxlen = 400)

embedding_dims <- 300
filters <- 64 
kernel_size <- 3 
hidden_dims <- 50
epochs <- 10
maxlen <- 400
batch_size <- 500

model <- keras_model_sequential()%>%
  layer_embedding(input_dim = max_features, output_dim = 128, weights = embedding_matrix) %>%  # I attempted to add weights here
  layer_lstm(units = 64, dropout = 0.2, recurrent_dropout = 0.2) %>% 
  layer_dense(units = 1, activation = 'sigmoid')%>% 
  
  compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = c('accuracy')
)

但是我组合它们的方式是错误的,因为它显示了一条错误消息:

py_call_impl(callable, dots$args, dots$keywords) 中的错误: ValueError:您在“embedding_1”层上调用了set_weights(weights),权重列表长度为 1001,但该层期望权重为 1。提供的权重:[[ 0.01752407 -0.03668756 0.00466535 ... 0.03698...

有谁知道如何正确使用嵌入矩阵?提前感谢您的帮助。

【问题讨论】:

【参考方案1】:

我为上述问题提供了代码 sn-p,因为它主要是形状问题,您可以在 R 中进行相应的更改。

我为我的LSTM 模型采用了 300 维嵌入矩阵。

embedding_matrix = np.zeros((max_features,300))
maxlen =50
inp = Input(shape=(maxlen,))
x = Embedding(max_features, 300, weights =[embedding_matrix])(inp)
.
.
.

【讨论】:

以上是关于R:如何结合 Word2Vec 嵌入和 LSTM 网络的主要内容,如果未能解决你的问题,请参考以下文章

在 keras 中使用预训练的 gensim Word2vec 嵌入

Embedding层和word2vec的区别

如何规范化词嵌入(word2vec)

学习笔记CB012: LSTM 简单实现完整实现torch小说训练word2vec lstm机器人

如何在windows下使用word2vec

Pytorch LSTM实现中文单词预测(附完整训练代码)