如何在 Keras 中实现分层 Transformer 用于文档分类?
Posted
技术标签:
【中文标题】如何在 Keras 中实现分层 Transformer 用于文档分类?【英文标题】:How to implement hierarchical Transformer for document classification in Keras? 【发布时间】:2022-01-13 05:09:33 【问题描述】:Yang 等人提出了文档分类的分层注意机制。 https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf
它的实现在https://github.com/ShawnyXiao/TextClassification-Keras上可用
另外,使用 Transformer 进行文档分类的实现在 https://keras.io/examples/nlp/text_classification_with_transformer 上可用
但是,它不是分层的。
我搜索了很多,但没有找到分层 Transformer 的任何实现。有谁知道如何在 Keras 中实现用于文档分类的分层转换器?
我的实现如下。请注意,该实现从 Nandan 实现扩展为文档分类。 https://keras.io/examples/nlp/text_classification_with_transformer.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras.utils.np_utils import to_categorical
class MultiHeadSelfAttention(layers.Layer):
def __init__(self, embed_dim, num_heads=8):
super(MultiHeadSelfAttention, self).__init__()
self.embed_dim = embed_dim
self.num_heads = num_heads
if embed_dim % num_heads != 0:
raise ValueError(
f"embedding dimension = embed_dim should be divisible by number of heads = num_heads"
)
self.projection_dim = embed_dim // num_heads
self.query_dense = layers.Dense(embed_dim)
self.key_dense = layers.Dense(embed_dim)
self.value_dense = layers.Dense(embed_dim)
self.combine_heads = layers.Dense(embed_dim)
def attention(self, query, key, value):
score = tf.matmul(query, key, transpose_b=True)
dim_key = tf.cast(tf.shape(key)[-1], tf.float32)
scaled_score = score / tf.math.sqrt(dim_key)
weights = tf.nn.softmax(scaled_score, axis=-1)
output = tf.matmul(weights, value)
return output, weights
def separate_heads(self, x, batch_size):
x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim))
return tf.transpose(x, perm=[0, 2, 1, 3])
def call(self, inputs):
# x.shape = [batch_size, seq_len, embedding_dim]
batch_size = tf.shape(inputs)[0]
query = self.query_dense(inputs) # (batch_size, seq_len, embed_dim)
key = self.key_dense(inputs) # (batch_size, seq_len, embed_dim)
value = self.value_dense(inputs) # (batch_size, seq_len, embed_dim)
query = self.separate_heads(
query, batch_size
) # (batch_size, num_heads, seq_len, projection_dim)
key = self.separate_heads(
key, batch_size
) # (batch_size, num_heads, seq_len, projection_dim)
value = self.separate_heads(
value, batch_size
) # (batch_size, num_heads, seq_len, projection_dim)
attention, weights = self.attention(query, key, value)
attention = tf.transpose(
attention, perm=[0, 2, 1, 3]
) # (batch_size, seq_len, num_heads, projection_dim)
concat_attention = tf.reshape(
attention, (batch_size, -1, self.embed_dim)
) # (batch_size, seq_len, embed_dim)
output = self.combine_heads(
concat_attention
) # (batch_size, seq_len, embed_dim)
return output
def compute_output_shape(self, input_shape):
# it does not change the shape of its input
return input_shape
class TransformerBlock(layers.Layer):
def __init__(self, embed_dim, num_heads, ff_dim, dropout_rate, name=None):
super(TransformerBlock, self).__init__(name=name)
self.att = MultiHeadSelfAttention(embed_dim, num_heads)
self.ffn = keras.Sequential(
[layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim), ]
)
self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
self.dropout1 = layers.Dropout(dropout_rate)
self.dropout2 = layers.Dropout(dropout_rate)
def call(self, inputs, training):
attn_output = self.att(inputs)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(inputs + attn_output)
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
return self.layernorm2(out1 + ffn_output)
def compute_output_shape(self, input_shape):
# it does not change the shape of its input
return input_shape
class TokenAndPositionEmbedding(layers.Layer):
def __init__(self, maxlen, vocab_size, embed_dim, name=None):
super(TokenAndPositionEmbedding, self).__init__(name=name)
self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)
def call(self, x):
maxlen = tf.shape(x)[-1]
positions = tf.range(start=0, limit=maxlen, delta=1)
positions = self.pos_emb(positions)
x = self.token_emb(x)
return x + positions
def compute_output_shape(self, input_shape):
# it changes the shape from (batch_size, maxlen) to (batch_size, maxlen, embed_dim)
return input_shape + (self.pos_emb.output_dim,)
# Lower level (produce a representation of each sentence):
embed_dim = 100 # Embedding size for each token
num_heads = 2 # Number of attention heads
ff_dim = 64 # Hidden layer size in feed forward network inside transformer
L1_dense_units = 100 # Size of the sentence-level representations output by the word-level model
dropout_rate = 0.1
vocab_size = 1000
class_number = 5
max_docs = 10000
max_sentences = 15
max_words = 60
word_input = layers.Input(shape=(max_words,), name='word_input')
word_embedding = TokenAndPositionEmbedding(maxlen=max_words, vocab_size=vocab_size,
embed_dim=embed_dim, name='word_embedding')(word_input)
word_transformer = TransformerBlock(embed_dim=embed_dim, num_heads=num_heads, ff_dim=ff_dim,
dropout_rate=dropout_rate, name='word_transformer')(word_embedding)
word_pool = layers.GlobalAveragePooling1D(name='word_pooling')(word_transformer)
word_drop = layers.Dropout(dropout_rate, name='word_drop')(word_pool)
word_dense = layers.Dense(L1_dense_units, activation="relu", name='word_dense')(word_drop)
word_encoder = keras.Model(word_input, word_dense)
word_encoder.summary()
# =========================================================================
# Upper level (produce a representation of each document):
L2_dense_units = 100
sentence_input = layers.Input(shape=(max_sentences, max_words), name='sentence_input')
sentence_encoder = tf.keras.layers.TimeDistributed(word_encoder, name='sentence_encoder')(sentence_input)
sentence_transformer = TransformerBlock(embed_dim=L1_dense_units, num_heads=num_heads, ff_dim=ff_dim,
dropout_rate=dropout_rate, name='sentence_transformer')(sentence_encoder)
sentence_pool = layers.GlobalAveragePooling1D(name='sentence_pooling')(sentence_transformer)
sentence_out = layers.Dropout(dropout_rate)(sentence_pool)
preds = layers.Dense(class_number , activation='softmax', name='sentence_output')(sentence_out)
model = keras.Model(sentence_input, preds)
model.summary()
模型总结如下:
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
word_input (InputLayer) [(None, 60)] 0
word_embedding (TokenAndPos (None, 60, 100) 106000
itionEmbedding)
word_transformer (Transform (None, 60, 100) 53764
erBlock)
word_pooling (GlobalAverage (None, 100) 0
Pooling1D)
word_drop (Dropout) (None, 100) 0
word_dense (Dense) (None, 100) 10100
=================================================================
Total params: 169,864
Trainable params: 169,864
Non-trainable params: 0
_________________________________________________________________
Model: "model_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
sentence_input (InputLayer) [(None, 15, 60)] 0
sentence_encoder (TimeDistr (None, 15, 100) 169864
ibuted)
sentence_transformer (Trans (None, 15, 100) 53764
formerBlock)
sentence_pooling (GlobalAve (None, 100) 0
ragePooling1D)
dropout_9 (Dropout) (None, 100) 0
sentence_output (Dense) (None, 5) 505
=================================================================
Total params: 224,133
Trainable params: 224,133
Non-trainable params: 0
一切正常,你可以将这些代码复制粘贴到 colab 中查看模型的摘要。 但是,我的问题是句子级别的位置编码。 如何在句子级别应用位置编码?
【问题讨论】:
【参考方案1】:在您将变压器 x 的输出平均值视为变压器 x+1 的输入的意义上,该实现是递归的。
假设您的数据结构为(批次、章节、段落、句子、标记)。
在第一次转换后,您最终得到 (batch, chapter, paragraph, sentence, embedding),然后平均得到 (batch, chapter, paragraph, sentence_embedding_in)。
应用另一个转换并获取(batch、chample、paragraph、sentence_embedding_out)。
再次平均并获得(批次、章节、段落嵌入)。冲洗并重复。
论文的实现实际上是在不同的存储库中: https://github.com/ematvey/hierarchical-attention-networks
他们实际上做了一些与我所描述的不同的事情,并在底部应用了转换器,在顶部应用了 RNN。从理论上讲,您可以做相反的事情或在每一层应用 RNN(那会非常慢)。就实现而言,您可以从中抽象出来 - 原理保持不变:您应用转换,平均输出并将其馈送到下一个更高级别的“层”(或使用火炬术语的“模块”)。
【讨论】:
非常感谢您的及时回复。我编辑了帖子并添加了我对该模型的实现。您能否查看这些代码并告诉我它是否已正确实施。我的问题是在句子级别的位置编码。根据实现的模型,你能告诉我如何在句子级别进行位置编码吗? 它应该以与单词完全相同的方式完成(您只需将每个句子视为一个单词) - 如果句子顺序很重要的话。在某些情况下它没有,所以你根本不添加任何东西 正如您在代码中看到的,TokenAndPositionEmbedding 将词汇大小作为输入之一。但在句子层面,我没有词汇量。所以我不知道如何应用句子级别的位置编码。有没有可能你看看我的模型并帮我完成它? 也许为每个句子创建一个虚拟标记(例如 0),以便您的 TokenAndPositionEmbedding 仅包含位置组件。然后将生成的嵌入添加到您的实际句子嵌入中。 你能告诉我代码吗?我的代码可以在 colab 中执行,没有任何错误。谢谢你。以上是关于如何在 Keras 中实现分层 Transformer 用于文档分类?的主要内容,如果未能解决你的问题,请参考以下文章