3w字的分享!Attention实现详细解析( tfa, keras 方法调用源码分析 & 自建网络)
Posted striving长亮
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了3w字的分享!Attention实现详细解析( tfa, keras 方法调用源码分析 & 自建网络)相关的知识,希望对你有一定的参考价值。
NLP系列讲解笔记
本专题是针对NLP的一些常用知识进行记录,主要由于本人接下来的实验需要用到NLP的一些知识点,但是本人非NLP方向学生,对此不是很熟悉,所以打算做个笔记记录一下自己的学习过程,也是为了博士的求学之路做铺垫,希望可以找到一个NLP方向的博导吧😩!希望大家喜欢。如果有哪里写的不对,欢迎大家批评指正,感谢感谢!
传送门:
目录
前言
最近几年,Attention模型在NLP乃至深度学习、人工智能领域都是一个相当热门的词汇,被学术界和工业界的广大学者放入自己的模型当中,并得到了不错的反馈。再加上BERT的强势表现以及Transformer的霸榜,让大家对Attention变得更加感兴趣,本人在上一篇文章对Attention模型的机制原理进行了详细的介绍分析,有兴趣的可以自行查看哟。
纸上得来终觉浅,绝知此事要躬行。机制原理、理论讲的再好,没有实验证明也白搭。实践是检验真理的唯一途径。上一篇文章出来,有人建议我出个代码讲解、具体实现方式。正好本人也要实验,所以我打算这一篇文章给大家详细讲解一下使用范围最广的Soft Attention以及Self Attention的代码实现,主要包括TensorFlow Addons、Keras封装函数以及自建网络三种形式,其他Attention模型的具体实现其实都差不太多,具体代码可以自行百度。
工具简介
本文主要采用的是TensorFlow、Keras以及tensorflow-addons等python库。
TensorFlow
TensorFlow 是一个端到端开源机器学习(主要是深度学习)平台。它拥有一个全面而灵活的生态系统,其中包含各种工具、库和社区资源,可助力研究人员推动先进机器学习技术的发展,并使开发者能够轻松地构建和部署由机器学习提供支持的应用。截止目前,最新版本为v2.6.0。
官网API:https://tensorflow.google.cn/api_docs/python/tf
Keras
Keras是一个由Python编写的开源人工神经网络库,可以作为Tensorflow、Microsoft-CNTK和Theano的高阶应用程序接口,进行深度学习模型的设计、调试、评估、应用和可视化。在TF2.x中可作为TensorFlow的高层API,并同步更新版本。
官网API:https://keras.io/zh/
tensorflow-addons
在TensorFlow2.x版本引入了 Special Interest Group (SIG),特殊兴趣小组,主要实现新发布论文中的算法。
TensorFlow2.x将tf.contrib移除,许多功能转移到了第三方库,而TensorFlow Addons 是一个符合完善的 API 模式但实现了核心 TensorFlow 中未提供的新功能的贡献仓库。TensorFlow 原生支持大量算子、层、指标、损失和优化器。但是,在像机器学习一样的快速发展领域中,有许多有趣的新开发成果无法集成到核心 TensorFlow 中(因为它们的广泛适用性尚不明确,或者主要由社区的较小子集使用)。
tensorflow-seq2seq在新版的tf中合并到tensorflow-addons中去了,除此之外tensorflow addons还有很多比较新的实现。
安装:pip install tensorflow-addons
官网API: https://tensorflow.google.cn/addons/api_docs/python/tfa
需要注意的是,tfa的版本需要和tf以及python的版本对应,不然运行会报错,具体版本对应信息看下图:
本实验所用库的版本信息
tensorflow-gpu==2.2.0
keras==2.4.3
tensorflow-addons==0.11.2
numpy==1.18.1
pandas==1.0.1
matplotlib==3.1.3
OK,工具介绍完毕,调参🦐上线!
Soft Attention
上一篇文章说到,在传统的Attention模型当中,最重要的就是生成attention score的F函数,目前为止,使用最广泛的是LuongAttention(参考3)以及BahdanauAttention(参考4),公式如下(图片来自参考1):
大家看公式知道了这两个的具体实现方式,其实二者的实现理念大致相同,只是实现细节上有很多区别,如图(图片来自参考2):
简单来说,Luong Attention 相较 Bahdanau Attention 主要有以下几点区别:
-
注意力的计算方式不同
在 Luong Attention 机制中,第 t 步的注意力 c t c_{t} ct是由 decoder 第 t 步的 hidden state h t h_{t} ht与 encoder 中的每一个 hidden state h ~ s \\tilde{h}_{s} h~s 加权计算得出的。而在 Bahdanau Attention 机制中,第 t 步的注意力 c t c_{t} ct是由 decoder 第 t-1 步的 hidden state h t − 1 h_{t-1} ht−1 与 encoder 中的每一个 hidden state h ~ s \\tilde{h}_{s} h~s加权计算得出的。 -
decoder 的输入输出不同
在 Bahdanau Attention 机制中,decoder 在第 t 步时,输入是由注意力 c t c_{t} ct与前一步的 hidden state h t − 1 h_{t-1} ht−1 拼接(concat方式,如上图)得出的,得到第 t 步的 hidden state h t h_{t} ht并直接输出 y t + 1 y_{t+1} yt+1。而 Luong Attention 机制在 decoder 部分建立了一层额外的网络结构,以注意力 c t c_{t} ct与原 decoder 第 t 步的 hidden state h t h_{t} ht 拼接作为输入,得到第 t 步的 hidden state h ~ t \\tilde{h}_{t} h~t 并输出 y t y_{t} yt 。
tfa的实现
在tfa.seq2seq中,对这两种常用的attention score已经进行了很好的封装,通过查看源码,我们可知:
LuongAttention
def _luong_score(query, keys, scale):
"""Implements Luong-style (multiplicative) scoring function.
This attention has two forms. The first is standard Luong attention,
as described in:
Minh-Thang Luong, Hieu Pham, Christopher D. Manning.
"Effective Approaches to Attention-based Neural Machine Translation."
EMNLP 2015. https://arxiv.org/abs/1508.04025
The second is the scaled form inspired partly by the normalized form of
Bahdanau attention.
To enable the second form, call this function with `scale=True`.
Args:
query: Tensor, shape `[batch_size, num_units]` to compare to keys.
keys: Processed memory, shape `[batch_size, max_time, num_units]`.
scale: the optional tensor to scale the attention score.
Returns:
A `[batch_size, max_time]` tensor of unnormalized score values.
Raises:
ValueError: If `key` and `query` depths do not match.
"""
depth = query.shape[-1]
key_units = keys.shape[-1]
if depth != key_units:
raise ValueError(
"Incompatible or unknown inner dimensions between query and keys. "
"Query (%s) has units: %s. Keys (%s) have units: %s. "
"Perhaps you need to set num_units to the keys' dimension (%s)?"
% (query, depth, keys, key_units, key_units)
)
# Reshape from [batch_size, depth] to [batch_size, 1, depth]
# for matmul.
query = tf.expand_dims(query, 1)
# Inner product along the query units dimension.
# matmul shapes: query is [batch_size, 1, depth] and
# keys is [batch_size, max_time, depth].
# the inner product is asked to **transpose keys' inner shape** to get a
# batched matmul on:
# [batch_size, 1, depth] . [batch_size, depth, max_time]
# resulting in an output shape of:
# [batch_size, 1, max_time].
# we then squeeze out the center singleton dimension.
score = tf.matmul(query, keys, transpose_b=True)
score = tf.squeeze(score, [1])
if scale is not None:
score = scale * score
return score
class LuongAttention(AttentionMechanism):
"""Implements Luong-style (multiplicative) attention scoring.
This attention has two forms. The first is standard Luong attention,
as described in:
Minh-Thang Luong, Hieu Pham, Christopher D. Manning.
[Effective Approaches to Attention-based Neural Machine Translation.
EMNLP 2015.](https://arxiv.org/abs/1508.04025)
The second is the scaled form inspired partly by the normalized form of
Bahdanau attention.
To enable the second form, construct the object with parameter
`scale=True`.
"""
@typechecked
def __init__(
self,
units: TensorLike,
memory: Optional[TensorLike] = None,
memory_sequence_length: Optional[TensorLike] = None,
scale: bool = False,
probability_fn: str = "softmax",
dtype: AcceptableDTypes = None,
name: str = "LuongAttention",
**kwargs,
):
"""Construct the AttentionMechanism mechanism.
Args:
units: The depth of the attention mechanism.
memory: The memory to query; usually the output of an RNN encoder.
This tensor should be shaped `[batch_size, max_time, ...]`.
memory_sequence_length: (optional): Sequence lengths for the batch
entries in memory. If provided, the memory tensor rows are masked
with zeros for values past the respective sequence lengths.
scale: Python boolean. Whether to scale the energy term.
probability_fn: (optional) string, the name of function to convert
the attention score to probabilities. The default is `softmax`
which is `tf.nn.softmax`. Other options is `hardmax`, which is
hardmax() within this module. Any other value will result
intovalidation error. Default to use `softmax`.
dtype: The data type for the memory layer of the attention mechanism.
name: Name to use when creating ops.
**kwargs: Dictionary that contains other common arguments for layer
creation.
"""
# For LuongAttention, we only transform the memory layer; thus
# num_units **must** match expected the query depth.
self.probability_fn_name = probability_fn
probability_fn = self._process_probability_fn(self.probability_fn_name)
def wrapped_probability_fn(score, _):
return probability_fn(score)
memory_layer = kwargs.pop("memory_layer", None)
if not memory_layer:
memory_layer = tf.keras.layers.Dense(
units, name="memory_layer", use_bias=False, dtype=dtype
)
self.units = units
self.scale = scale
self.scale_weight = None
super().__init__(
memory=memory,
memory_sequence_length=memory_sequence_length,
query_layer=None,
memory_layer=memory_layer,
probability_fn=wrapped_probability_fn,
name=name,
dtype=dtype,
**kwargs,
)
def build(self, input_shape):
super().build(input_shape)
if self.scale and self.scale_weight is None:
self.scale_weight = self.add_weight(
"attention_g", initializer=tf.ones_initializer, shape=()
)
self.built = True
def _calculate_attention(self, query, state):
"""Score the query based on the keys and values.
Args:
query: Tensor of dtype matching `self.values` and shape
`[batch_size, query_depth]`.
state: Tensor of dtype matching `self.values` and shape
`[batch_size, alignments_size]`
(`alignments_size` is memory's `max_time`).
Returns:
alignments: Tensor of dtype matching `self.values` and shape
`[batch_size, alignments_size]` (`alignments_size` is memory's
`max_time`).
next_state: Same as the alignments.
"""
score = _luong_score(query, self.keys, self.scale_weight)
alignments = self.probability_fn(score, state)
next_state = alignments
return alignments, next_state
def get_config(self):
config = {
"units": self.units,
"scale": self.scale,
"probability_fn": self.probability_fn_name,
}
base_config = super().get_config()
return {**base_config, **config}
@classmethod
def from_config(cls, config, custom_objects=None):
config = AttentionMechanism.deserialize_inner_layer_from_config(
config, custom_objects=custom_objects
)
return cls(**config)
从源码中我们不难找到score的具体计算方式:
def _luong_score(query, keys, scale):
'''前面代码省略,只看score计算代码'''
score = tf.matmul(query, keys, transpose_b=True)
score = tf.squeeze(score, [1])
# scale为权重矩阵,可选
if scale is not None:
score = scale * score
return score
class LuongAttention(AttentionMechanism):
'''代码省略'''
score = _luong_score(query, self.keys, self.scale_weight)
alignments = self.probability_fn(score, state) # 默认softmax
next_state = alignments
return alignments, next_state
'''代码省略'''
使用方式:
tfa.seq2seq.LuongAttention(
# 神经元个数,也是最终attention的输出维度
units: tfa.types.TensorLike,
# 可选,The memory to query,一般为RNN encoder的输出。维度为[batch_size, max_time, ...]
memory: Optional[TensorLike] = None,
# 可选参数。批次的序列长度。 主要是用来mask,超过相应真实的序列长度的值,补零,具体可查看mask层介绍
memory_sequence_length: Optional[TensorLike] = None,
scale: bool = False, #是否添加权重W
probability_fn: str = 'softmax', #default
dtype: tfa.types.AcceptableDTypes = None, # 数据类型
name: str = 'LuongAttention',
**kwargs
)
需要说明几点:
- units参数:
神经元节点数,我们知道在计算score的时候,需要使用 Decoder的 h t − 1 h_{t-1} ht−1 或 h t h_{t} ht 和Encoder的 h ~ s \\tilde{h}_{s} h~s 来进行计算,而二者的维度可能并不是统一的,需要进行变换和统一,所以这里就有了 Wa 和 Ua 这两个系数,所以在代码中就是用 units 来声明了一个全连接 Dense 网络,用于统一二者的维度,以便于下一步的计算,从代码我们可以看到:
memory_layer = kwargs.pop("memory_layer", None)
if not memory_layer:
memory_layer = tf.keras.layers.Dense(
units, name="memory_layer", use_bias=False, dtype=dtype
)
- scale参数:
源码已经做出了解释:The second is the scaled form inspired partly by the normalized form of Bahdanau attention.To enable the second form, construct the object with parameterscale=True
.
if self.scale and self.scale_weight is None:
self.scale_weight = self.add_weight(
"attention_g", initializer=tf.ones_initializer, shape=()
)
Bahdanau Attention
Bahdanau Attention和LuongAttention的区别上面已经说明了,接下来直接上源码:
def _bahdanau_score(
processed_query, keys, attention_v, attention_g=None, attention_b=None
):
"""Implements Bahdanau-style (additive) scoring function.
This attention has two forms. The first is Bahdanau attention,
as described in:
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio.
"Neural Machine Translation by Jointly Learning to Align and Translate."
ICLR 2015. https://arxiv.org/abs/1409.0473
The second is the normalized form. This form is inspired by the
weight normalization article:
Tim Salimans, Diederik P. Kingma.
"Weight Normalization: A Simple Reparameterization to Accelerate
Training of Deep Neural Networks."
https://arxiv.org/abs/1602.07868
To enable the second form, set please pass in attention_g and attention_b.
Args:
processed_query: Tensor, shape `[batch_size, num_units]` to compare to
keys.
keys: Processed memory, shape `[batch_size, max_time, num_units]`.
attention_v: Tensor, shape `[num_units]`.
attention_g: Optional scalar tensor for normalization.
attention_b: Optional tensor with shape `[num_units]` for normalization.
Returns:
A `[batch_size, max_time]` tensor of unnormalized score values.
"""
# Reshape from [batch_size, ...] to [batch_size, 1, ...] for broadcasting.
processed_query = tf.expand_dims(processed_query, 1)
if attention_g is not None and attention_b is not None:
normed_v = (
attention_g
* attention_v
* tf.math.rsqrt(tf.reduce_sum(tf.square(attention_v)))
)
return tf.reduce_sum(
normed_v * tf.tanh(keys + processed_query + attention_b), [2]
)
else:
return tf.reduce_sum(attention_v * tf.tanh(keys + processed_query), [2])
class BahdanauAttention(AttentionMechanism):
"""Implements Bahdanau-style (additive) attention.
This attention has two forms. The first is Bahdanau attention,
as described in:
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio.
"Neural Machine Translation by Jointly Learning to Align and Translate."
ICLR 2015. https://arxiv.org/abs/1409.0473
The second is the normalized form. This form is inspired by the
weight normalization article:
Tim Salimans, Diederik P. Kingma.
"Weight Normalization: A Simple Reparameterization to Accelerate
Training of Deep Neural Networks."
https://arxiv.org/abs/1602.07868
To enable the second form, construct the object with parameter
`normalize=True`.
"""
@typechecked
def __init__(
self,
units: TensorLike,
memory: Optional[TensorLike] = None,
memory_sequence_length: Optional[TensorLike] = None,
normalize: bool = False,
probability_fn: str = "softmax",
kernel_initializer: Initializer = "glorot_uniform",
dtype: AcceptableDTypes = None,
name: str = "BahdanauAttention",
**kwargs,
):
"""Construct the Attention mechanism.
Args:
units: The depth of the query mechanism.
memory: The memory to query; usually the output of an RNN encoder.
This tensor should be shaped `[batch_size, max_time, ...]`.
memory_sequence_length: (optional): Sequence lengths for the batch
entries in memory. If provided, the memory tensor rows are masked
with zeros for values past the respective sequence lengths.
normalize: Python boolean. Whether to normalize the energy term.
probability_fn: (optional) string以上是关于3w字的分享!Attention实现详细解析( tfa, keras 方法调用源码分析 & 自建网络)的主要内容,如果未能解决你的问题,请参考以下文章
多图+公式全面解析RNN,LSTM,Seq2Seq,Attention注意力机制
指针的这些知识你知道吗?C语言超硬核指针进阶版3w+字详解+指针笔试题画图+文字详细讲解