第八周.02.Transformer代码讲解
Posted oldmao_2000
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了第八周.02.Transformer代码讲解相关的知识,希望对你有一定的参考价值。
本文内容整理自深度之眼《GNN核心能力培养计划》
公式输入请参考: 在线Latex公式
接上节,下面以哈佛大学nlp组的代码进行具体讲解。github和colab上的代码不一样。。。不知道为什么,以github的为准。
Model Architecture
核心代码就是EncoderDecoder中的forward:
self.decode(self.encode(src, src_mask), src_mask,
tgt, tgt_mask)
意思就是Decoder解码encoder的结果。
class EncoderDecoder(nn.Module):
"""
A standard Encoder-Decoder architecture. Base for this and many
other models.
"""
def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
super(EncoderDecoder, self).__init__()
self.encoder = encoder
self.decoder = decoder
self.src_embed = src_embed
self.tgt_embed = tgt_embed
self.generator = generator#用于最后输出维度映射到词表大小,类在下面
#src_mask用于不同长度的序列的padding处理
#tgt_mask用于确保输出过程不涉及后面的单词
def forward(self, src, tgt, src_mask, tgt_mask):
"Take in and process masked src and target sequences."
return self.decode(self.encode(src, src_mask), src_mask,
tgt, tgt_mask)
def encode(self, src, src_mask):
return self.encoder(self.src_embed(src), src_mask)
def decode(self, memory, src_mask, tgt, tgt_mask):
return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)
class Generator(nn.Module):
"Define standard linear + softmax generation step."
def __init__(self, d_model, vocab):
super(Generator, self).__init__()
self.proj = nn.Linear(d_model, vocab)#线性映射,转化维度
def forward(self, x):
return F.log_softmax(self.proj(x), dim=-1)
Encoder and Decoder Stacks
然后根据原论文的图
要把每个单元堆叠N次。这里N为6.
# 克隆单元结构
def clones(module, N):
"Produce N identical layers."
return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])
Encoder
class Encoder(nn.Module):
"Core encoder is a stack of N layers"
def __init__(self, layer, N):
super(Encoder, self).__init__()
self.layers = clones(layer, N)#Encoder包含N个EncoderLayer
self.norm = LayerNorm(layer.size)
def forward(self, x, mask):
"Pass the input (and mask) through each layer in turn."
#前向传播就是串行的堆叠,后面吃前面的输出做为输入
for layer in self.layers:
x = layer(x, mask)
return self.norm(x)#最后这里做的是自己实现的Layer归一化,定义在下面
#自己定义的Layer归一化,这个代码号称是不调用额外的包的,所以这个自己实现
#对应的就是上面图中黄色方块中的norm操作
class LayerNorm(nn.Module):
"Construct a layernorm module (See citation for details)."
def __init__(self, features, eps=1e-6):
super(LayerNorm, self).__init__()
self.a_2 = nn.Parameter(torch.ones(features))
self.b_2 = nn.Parameter(torch.zeros(features))
self.eps = eps
def forward(self, x):
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
return self.a_2 * (x - mean) / (std + self.eps) + self.b_2
#对应的就是上面图中黄色方块中的add操作,相当于残差操作
#Sublayer根据上图有两种,一种是Multi-head attention(橙色方块),一种是Feed forward(蓝色方块)
class SublayerConnection(nn.Module):
"""
A residual connection followed by a layer norm.
Note for code simplicity the norm is first as opposed to last.
"""
def __init__(self, size, dropout):
super(SublayerConnection, self).__init__()
self.norm = LayerNorm(size)
self.dropout = nn.Dropout(dropout)
def forward(self, x, sublayer):
"Apply residual connection to any sublayer with the same size."
return x + self.dropout(sublayer(self.norm(x)))
#这里定义的是一个EncoderLayer单元,对应上图中的左边部分
class EncoderLayer(nn.Module):
"Encoder is made up of self-attn and feed forward (defined below)"
def __init__(self, size, self_attn, feed_forward, dropout):
super(EncoderLayer, self).__init__()
self.self_attn = self_attn#Multi-head attention(橙色方块)
self.feed_forward = feed_forward#Feed forward(蓝色方块)
# 克隆两个SublayerConnection,分别给上面两个模块,因为两个模型残差不一样
self.sublayer = clones(SublayerConnection(size, dropout), 2)
self.size = size
def forward(self, x, mask):
"Follow Figure 1 (left) for connections."
# Se1f- Attention层
# self_attn的4个参数:Query,Key, Value,Mask
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
# Dense层
return self.sublayer[1](x, self.feed_forward)
#sublayer[0]是Multi-head attention(橙色方块)
#sublayer[1]是Feed forward(蓝色方块)
下面重点理解forward函数中的两句代码
第一句:
输入是x,然后sublayer[0],这个时候会执行SublayerConnection的forward,然后先得到:
LayerNorm(x)
由于sublayer[0]参数给的是self_attn,所以是要进行注意力计算
self_attn(LayerNorm(x))
然后再进行残差操作:
x+self_attn(LayerNorm(x))
我们把上面得到的结果记为y,接下来执行第二句:
y→LayerNorm(y)→Dense(LayerNorm(y))→y+Dense(LayerNorm(y))
Decoder
同样要堆叠6层
class Decoder(nn.Module):
"Generic N layer decoder with masking."
def __init__(self, layer, N):
super(Decoder, self).__init__()
self.layers = clones(layer, N)
self.norm = LayerNorm(layer.size)
def forward(self, x, memory, src_mask, tgt_mask):
for layer in self.layers:
#x:输入的embedding;memory:EncoderLayer的输出
#src_mask:Encoder的mask用于padding,tgt_mask:Decoder的mask用于隐藏后面的单词的输出
x = layer(x, memory, src_mask, tgt_mask)
return self.norm(x)
class DecoderLayer(nn.Module):
"Decoder is made of self-attn, src-attn, and feed forward (defined below)"
def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
super(DecoderLayer, self).__init__()
self.size = size
self.self_attn = self_attn#对应图中的Decoder的下面部分
self.src_attn = src_attn#对应图中的Decoder的中间部分
self.feed_forward = feed_forward#对应图中的Decoder的上面部分
self.sublayer = clones(SublayerConnection(size, dropout), 3)
def forward(self, x, memory, src_mask, tgt_mask):
"Follow Figure 1 (right) for connections."
m = memory
# 第一层 对应图中的Decoder的下面部分,这里是相当于输出的预测的自注意力,因此是tgt_mask
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
#第二层,是最后输出与Encoder输出之间的attention计算
x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
return self.sublayer[2](x, self.feed_forward)
def subsequent_mask(size):
"Mask out subsequent positions."
# Decoder解码时刻t只能使用时刻1...t的输入,不能使用t+1时刻及其之后的输入(未来的信息)
attn_shape = (1, size, size)
# triu是创建上三角矩阵,右上半部分都是1,左下半部分都是0,包括对角线是0
subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
# 1,0互换,变下三角矩阵
return torch.from_numpy(subsequent_mask) == 0
这个是20×20的下三角矩阵,可以看到第一行,只有第一个位置是1(只能看第一个单词),其他位置都是0,因此后面位置都屏蔽了。
Attention
这里的注意力机制是要实现矩阵形式的QKV计算,和单个token的向量计算稍微有点区别。
def attention(query, key, value, mask=None, dropout=None):
"Compute 'Scaled Dot Product Attention'"
#输出是Value的加权平均,权重通过Query和Key计算
#K/Q/V Tensor size:(batch, #head头数, sequence length, feature dimension(d_k))
#K/Q要相乘, size需相同,V可以不同
# feature dimension
d_k = query.size(-1)
# 这里QK最后两个维度进行矩阵乘法,维度变成:
# (batch, #head, sequence length, sequence length)
# 这里面 sequence1 ength会出现 padding的情况
scores = torch.matmul(query, key.transpose(-2, -1)) \\
/ math.sqrt(d_k)
#mask为0的地方,给一个无穷小量,使其经过softmax操作近似为0
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
#经过softmax后的维度不变,(batch, #head, sequence length, sequence length)
p_attn = F.softmax(scores, dim = -1)
if dropout is not None:
p_attn = dropout(p_attn)
# 返回的两个玩意,第一个值是softmax后的权重和V相乘,V的维度不变:
# (batch, #head头数, sequence length, feature dimension(d_k))
# 第二玩意维度上面有
return torch.matmul(p_attn, value), p_attn
单头attention写成多头,实际上是单头结果进行拼接:
class MultiHeadedAttention(nn.Module):
def __init__(self, h, d_model, dropout=0.1):
"Take in model size and number of heads."
super(MultiHeadedAttention, self).__init__()
assert d_model % h == 0
# We assume d_v always equals d_k
self.d_k = d_model // h,8个头
self.h = h
#这里克隆4个线性变换,前三个对QKV做特征变换,最后一个是输出的的特征变换
self.linears = clones(nn.Linear(d_model, d_model), 4)
self.attn = None
self.dropout = nn.Dropout(p=dropout)
def forward(self, query, key, value, mask=None):
#query, key, value的维度上面有,是
#(batch, #head头数, sequence length, feature dimension(d_k))
#那么对于encoder而言,它需要的mask的维度为:
#(batch, 1, 1, sequence length)
#意思是对所有的句子都有一个mask,mask要把所有句子都padding到sequence length,由于所有注意力头吃的输入一样,所以里面padding长度一样,所以第二个维度是1,由于对每个句子输入过程不用忽略后面的词语,因此每个token都参与attention计算,第三维设置为1
#对于Decoder而言,第三个维度由于要忽略后面的词,所以其mask维度是:
#(batch, 1, sequence length, sequence length)
"Implements Figure 2"
if mask is not None:
# Same mask applied to all h heads.
mask = mask.unsqueeze(1)
nbatches = query.size(0)#取第一个维度,batch
# 1) Do all the linear projections in batch from d_model => h x d_k
# zip是将对象中对应的元素打包成一个个元组
# 例如:a = [1,2,3]
# b = [4,5,6]
# zipped = zip(a,b)
# 结果:[(1, 4), (2, 5), (3, 6)]
# 这里得到结果是将linears里面的三层与query, key, value分别进行打包
# (self.linears[0], self.linears[1], self.linears[2])&(query, key, value)
# 然后做l(x),即相乘操作,以self.linears[0]和query为例
# self linears[0]的维度是(d_model ,d_model)
# query的维度是(batch, sequence length, d_model),开始是三维的,经过这里处理才变成上面四维tensor形式
# 这里的d_model=#head*d_k,下面看怎么变四维。
# l(x)后的维度不变:(batch, sequence length, d_model)
# 然后做view变换:(batch, sequence length, #head, d_k)
# 然后transpose(1, 2):(batch, #head头数, sequence length, feature dimension(d_k))
query, key, value = \\
[l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linears, (query, key, value))]
# 2) Apply attention on all the projected vectors in batch.
#根据上面的attention函数可知,返回两个东西的维度是:
#(batch, #head头数, sequence length, feature dimension(d_k))
#(batch, #head, sequence length, sequence length)
x, self.attn = attention(query, key, value, mask=mask,
dropout=self.dropout)
# 3) "Concat" using a view and apply a final linear.
#x.transpose(1, 2)后维度变成:(batch, sequence length, #head头数, feature dimension(d_k))
#然后经过view:(batch, sequence length, d_model)
x = x.transpose(1, 2).contiguous() \\
.view(nbatches, -1, self.h * self.d_k)
return self.linears[-1](x)
以上是关于第八周.02.Transformer代码讲解的主要内容,如果未能解决你的问题,请参考以下文章
广东海洋大学 电子1151 孔yanfei python语言程序设计 第八周