从通用句子编码器输出为 LSTM 生成输入
Posted
技术标签:
【中文标题】从通用句子编码器输出为 LSTM 生成输入【英文标题】:Generating Input for LSTM from universal sentence encoder output 【发布时间】:2019-10-14 18:16:39 【问题描述】:我正在使用 LSTM 和从通用句子编码器获得的嵌入来解决多类分类问题。
以前我使用 Glove 嵌入,我得到了 LSTM 所需的输入形状(batch_size、timesteps、input_dim)。我打算用Universal sentence encoder 发现Universal Sentence Encoder的输出是2d [batch, feature]。如何进行所需的更改。
LSTM + 通用句子编码器
EMBED_SIZE = 512
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
embed = hub.Module(module_url)
def UniversalEmbedding(x):
return embed(tf.squeeze(tf.cast(x, tf.string)),
signature="default", as_dict=True)["default"]
seq_input = Input(shape=(MAX_SEQUENCE_LENGTH,),dtype='int32')
print("seq i",seq_input.shape,seq_input)
embedded_seq = Lambda(UniversalEmbedding,
output_shape=(EMBED_SIZE,))(seq_input)
print("EMD SEQ",embedding.shape,type(embedded_seq))
# (timesteps, n_features) (,MAX_SEQUENCE_LENGTH, EMBED_SIZE) (,150,512)
x_1 = LSTM(units=NUM_LSTM_UNITS,
name='blstm_1',
dropout=DROP_RATE_LSTM)(embedded_seq)
print(x_1)
这会产生以下错误
seq i (?, 150) Tensor("input_8:0", shape=(?, 150), dtype=int32)
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
I0529 07:24:32.504808 140127577749376 saver.py:1483] Saver not created because there are no variables in the graph to restore
EMD SEQ (?, 512) <class 'tensorflow.python.framework.ops.Tensor'>
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-ea634319205b> in <module>()
12 x_1 = LSTM(units=NUM_LSTM_UNITS,
13 name='blstm_1',
---> 14 dropout=DROP_RATE_LSTM)(embedded_seq)
15 print(x_1)
16
2 frames
/usr/local/lib/python3.6/dist-packages/keras/engine/base_layer.py in assert_input_compatibility(self, inputs)
309 self.name + ': expected ndim=' +
310 str(spec.ndim) + ', found ndim=' +
--> 311 str(K.ndim(x)))
312 if spec.max_ndim is not None:
313 ndim = K.ndim(x)
ValueError: Input 0 is incompatible with layer blstm_1: expected ndim=3, found ndim=2
LSTM + Glove 嵌入
embedding_layer = Embedding(nb_words,
EMBED_SIZE,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
seq_input = Input(shape=(MAX_SEQUENCE_LENGTH,),dtype='int32')
print("SEQ INP",seq_input,seq_input.shape)
embedded_seq = embedding_layer(seq_input)
print("EMD SEQ",embedded_seq.shape)
# Bi-directional LSTM # (timesteps, n_features)
x_1 = Bidirectional(LSTM(units=NUM_LSTM_UNITS,
name='blstm_1',
dropout=DROP_RATE_LSTM,
recurrent_dropout=DROP_RATE_LSTM),
merge_mode='concat')(embedded_seq)
x_1 = Dropout(DROP_RATE_DENSE)(x_1)
x_1 = Dense(NUM_DENSE_UNITS,activation='relu')(x_1)
x_1 = Dropout(DROP_RATE_DENSE)(x_1)
输出(这适用于 LSTM)
SEQ INP Tensor("input_2:0", shape=(?, 150), dtype=int32) (?, 150)
EMD SEQ (?, 150, 300)
【问题讨论】:
【参考方案1】:Sentence Encoder 与 word2vec 或 Glove 不同,它不是词级嵌入:
该模型针对大于字长的文本进行了训练和优化, 例如句子、短语或短段落。它被训练在一个 各种数据源和各种任务,目的是 动态适应各种自然语言 理解任务。输入是可变长度的英文文本和 输出是一个 512 维向量。我们将此模型应用于 STS 语义相似度的基准,结果可以在 示例笔记本可用。通用句子编码器模型 使用深度平均网络 (DAN) 编码器进行训练。
上面他们使用“lambda”函数的例子是FF神经网络,下一层的输入是2D,不像CNN的RNN(3D)。
简而言之,您要做的就是在使用嵌入层将其输入到您的网络之前准备好您的文本:
def process_text(sentences_list):
path = './processed_data'
embeddings_file = "embeddings-.pickle".format(len(sentences_list))
if not os.path.isfile(join(path, embeddings_file)):
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
embed = hub.Module(module_url)
with tf.Session() as sess:
sess.run([tf.global_variables_initializer(), tf.tables_initializer()])
sentences_list = sess.run(embed(sentences_list))
sentences_list = np.array(sentences_list)
sentences_list = np.array([np.reshape(embedding, (len(embedding), 1)) for embedding in sentences_list])
pickle.dump(sentences_list, open(embeddings_file, 'wb'))
else:
sentences_list = pickle.load(open(join(path, embeddings_file), 'rb'))
return sentences_list
我建议您保存生成的嵌入,就像我在示例中所做的那样,因为检索嵌入需要一些时间。
来源:Sentiment Analysis on Twitter Data using Universal Sentence Encoder
【讨论】:
以上是关于从通用句子编码器输出为 LSTM 生成输入的主要内容,如果未能解决你的问题,请参考以下文章