3w字的分享!Attention实现详细解析( tfa, keras 方法调用源码分析 & 自建网络)

Posted striving长亮

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了3w字的分享!Attention实现详细解析( tfa, keras 方法调用源码分析 & 自建网络)相关的知识,希望对你有一定的参考价值。

NLP系列讲解笔记

本专题是针对NLP的一些常用知识进行记录,主要由于本人接下来的实验需要用到NLP的一些知识点,但是本人非NLP方向学生,对此不是很熟悉,所以打算做个笔记记录一下自己的学习过程,也是为了博士的求学之路做铺垫,希望可以找到一个NLP方向的博导吧😩!希望大家喜欢。如果有哪里写的不对,欢迎大家批评指正,感谢感谢!

传送门:

第一章 细讲:Attention模型的机制原理



前言

最近几年,Attention模型在NLP乃至深度学习、人工智能领域都是一个相当热门的词汇,被学术界和工业界的广大学者放入自己的模型当中,并得到了不错的反馈。再加上BERT的强势表现以及Transformer的霸榜,让大家对Attention变得更加感兴趣,本人在上一篇文章对Attention模型的机制原理进行了详细的介绍分析,有兴趣的可以自行查看哟。

纸上得来终觉浅,绝知此事要躬行。机制原理、理论讲的再好,没有实验证明也白搭。实践是检验真理的唯一途径。上一篇文章出来,有人建议我出个代码讲解、具体实现方式。正好本人也要实验,所以我打算这一篇文章给大家详细讲解一下使用范围最广的Soft Attention以及Self Attention的代码实现,主要包括TensorFlow Addons、Keras封装函数以及自建网络三种形式,其他Attention模型的具体实现其实都差不太多,具体代码可以自行百度。


工具简介

本文主要采用的是TensorFlow、Keras以及tensorflow-addons等python库。

TensorFlow

TensorFlow 是一个端到端开源机器学习(主要是深度学习)平台。它拥有一个全面而灵活的生态系统,其中包含各种工具、库和社区资源,可助力研究人员推动先进机器学习技术的发展,并使开发者能够轻松地构建和部署由机器学习提供支持的应用。截止目前,最新版本为v2.6.0。

官网API:https://tensorflow.google.cn/api_docs/python/tf

Keras

Keras是一个由Python编写的开源人工神经网络库,可以作为Tensorflow、Microsoft-CNTK和Theano的高阶应用程序接口,进行深度学习模型的设计、调试、评估、应用和可视化。在TF2.x中可作为TensorFlow的高层API,并同步更新版本。

官网API:https://keras.io/zh/

tensorflow-addons

在TensorFlow2.x版本引入了 Special Interest Group (SIG),特殊兴趣小组,主要实现新发布论文中的算法

TensorFlow2.x将tf.contrib移除,许多功能转移到了第三方库,而TensorFlow Addons 是一个符合完善的 API 模式但实现了核心 TensorFlow 中未提供的新功能的贡献仓库。TensorFlow 原生支持大量算子、层、指标、损失和优化器。但是,在像机器学习一样的快速发展领域中,有许多有趣的新开发成果无法集成到核心 TensorFlow 中(因为它们的广泛适用性尚不明确,或者主要由社区的较小子集使用)。

tensorflow-seq2seq在新版的tf中合并到tensorflow-addons中去了,除此之外tensorflow addons还有很多比较新的实现。

安装:pip install tensorflow-addons

官网API: https://tensorflow.google.cn/addons/api_docs/python/tfa

需要注意的是,tfa的版本需要和tf以及python的版本对应,不然运行会报错,具体版本对应信息看下图:

本实验所用库的版本信息

tensorflow-gpu==2.2.0
keras==2.4.3
tensorflow-addons==0.11.2
numpy==1.18.1
pandas==1.0.1
matplotlib==3.1.3

OK,工具介绍完毕,调参🦐上线!

Soft Attention

上一篇文章说到,在传统的Attention模型当中,最重要的就是生成attention score的F函数,目前为止,使用最广泛的是LuongAttention(参考3)以及BahdanauAttention(参考4),公式如下(图片来自参考1):

大家看公式知道了这两个的具体实现方式,其实二者的实现理念大致相同,只是实现细节上有很多区别,如图(图片来自参考2):

简单来说,Luong Attention 相较 Bahdanau Attention 主要有以下几点区别:

  • 注意力的计算方式不同
    在 Luong Attention 机制中,第 t 步的注意力 c t c_{t} ct是由 decoder 第 t 步的 hidden state h t h_{t} ht与 encoder 中的每一个 hidden state h ~ s \\tilde{h}_{s} h~s 加权计算得出的。而在 Bahdanau Attention 机制中,第 t 步的注意力 c t c_{t} ct是由 decoder 第 t-1 步的 hidden state h t − 1 h_{t-1} ht1 与 encoder 中的每一个 hidden state h ~ s \\tilde{h}_{s} h~s加权计算得出的。

  • decoder 的输入输出不同
    在 Bahdanau Attention 机制中,decoder 在第 t 步时,输入是由注意力 c t c_{t} ct与前一步的 hidden state h t − 1 h_{t-1} ht1 拼接(concat方式,如上图)得出的,得到第 t 步的 hidden state h t h_{t} ht并直接输出 y t + 1 y_{t+1} yt+1。而 Luong Attention 机制在 decoder 部分建立了一层额外的网络结构,以注意力 c t c_{t} ct与原 decoder 第 t 步的 hidden state h t h_{t} ht 拼接作为输入,得到第 t 步的 hidden state h ~ t \\tilde{h}_{t} h~t 并输出 y t y_{t} yt

tfa的实现

在tfa.seq2seq中,对这两种常用的attention score已经进行了很好的封装,通过查看源码,我们可知:

LuongAttention

def _luong_score(query, keys, scale):
    """Implements Luong-style (multiplicative) scoring function.
    
    This attention has two forms.  The first is standard Luong attention,
    as described in:

    Minh-Thang Luong, Hieu Pham, Christopher D. Manning.
    "Effective Approaches to Attention-based Neural Machine Translation."
    EMNLP 2015.  https://arxiv.org/abs/1508.04025

    The second is the scaled form inspired partly by the normalized form of
    Bahdanau attention.

    To enable the second form, call this function with `scale=True`.

    Args:
      query: Tensor, shape `[batch_size, num_units]` to compare to keys.
      keys: Processed memory, shape `[batch_size, max_time, num_units]`.
      scale: the optional tensor to scale the attention score.

    Returns:
      A `[batch_size, max_time]` tensor of unnormalized score values.

    Raises:
      ValueError: If `key` and `query` depths do not match.
    """
    depth = query.shape[-1]
    key_units = keys.shape[-1]
    if depth != key_units:
        raise ValueError(
            "Incompatible or unknown inner dimensions between query and keys. "
            "Query (%s) has units: %s.  Keys (%s) have units: %s.  "
            "Perhaps you need to set num_units to the keys' dimension (%s)?"
            % (query, depth, keys, key_units, key_units)
        )

    # Reshape from [batch_size, depth] to [batch_size, 1, depth]
    # for matmul.
    query = tf.expand_dims(query, 1)

    # Inner product along the query units dimension.
    # matmul shapes: query is [batch_size, 1, depth] and
    #                keys is [batch_size, max_time, depth].
    # the inner product is asked to **transpose keys' inner shape** to get a
    # batched matmul on:
    #   [batch_size, 1, depth] . [batch_size, depth, max_time]
    # resulting in an output shape of:
    #   [batch_size, 1, max_time].
    # we then squeeze out the center singleton dimension.
    score = tf.matmul(query, keys, transpose_b=True)
    score = tf.squeeze(score, [1])

    if scale is not None:
        score = scale * score
    return score


class LuongAttention(AttentionMechanism):
    """Implements Luong-style (multiplicative) attention scoring.

    This attention has two forms.  The first is standard Luong attention,
    as described in:

    Minh-Thang Luong, Hieu Pham, Christopher D. Manning.
    [Effective Approaches to Attention-based Neural Machine Translation.
    EMNLP 2015.](https://arxiv.org/abs/1508.04025)

    The second is the scaled form inspired partly by the normalized form of
    Bahdanau attention.

    To enable the second form, construct the object with parameter
    `scale=True`.
    """

    @typechecked
    def __init__(
        self,
        units: TensorLike,
        memory: Optional[TensorLike] = None,
        memory_sequence_length: Optional[TensorLike] = None,
        scale: bool = False,
        probability_fn: str = "softmax",
        dtype: AcceptableDTypes = None,
        name: str = "LuongAttention",
        **kwargs,
    ):
        """Construct the AttentionMechanism mechanism.

        Args:
          units: The depth of the attention mechanism.
          memory: The memory to query; usually the output of an RNN encoder.
            This tensor should be shaped `[batch_size, max_time, ...]`.
          memory_sequence_length: (optional): Sequence lengths for the batch
            entries in memory.  If provided, the memory tensor rows are masked
            with zeros for values past the respective sequence lengths.
          scale: Python boolean. Whether to scale the energy term.
          probability_fn: (optional) string, the name of function to convert
            the attention score to probabilities. The default is `softmax`
            which is `tf.nn.softmax`. Other options is `hardmax`, which is
            hardmax() within this module. Any other value will result
            intovalidation error. Default to use `softmax`.
          dtype: The data type for the memory layer of the attention mechanism.
          name: Name to use when creating ops.
          **kwargs: Dictionary that contains other common arguments for layer
            creation.
        """
        # For LuongAttention, we only transform the memory layer; thus
        # num_units **must** match expected the query depth.
        self.probability_fn_name = probability_fn
        probability_fn = self._process_probability_fn(self.probability_fn_name)

        def wrapped_probability_fn(score, _):
            return probability_fn(score)

        memory_layer = kwargs.pop("memory_layer", None)
        if not memory_layer:
            memory_layer = tf.keras.layers.Dense(
                units, name="memory_layer", use_bias=False, dtype=dtype
            )
        self.units = units
        self.scale = scale
        self.scale_weight = None
        super().__init__(
            memory=memory,
            memory_sequence_length=memory_sequence_length,
            query_layer=None,
            memory_layer=memory_layer,
            probability_fn=wrapped_probability_fn,
            name=name,
            dtype=dtype,
            **kwargs,
        )

    def build(self, input_shape):
        super().build(input_shape)
        if self.scale and self.scale_weight is None:
            self.scale_weight = self.add_weight(
                "attention_g", initializer=tf.ones_initializer, shape=()
            )
        self.built = True

    def _calculate_attention(self, query, state):
        """Score the query based on the keys and values.

        Args:
          query: Tensor of dtype matching `self.values` and shape
            `[batch_size, query_depth]`.
          state: Tensor of dtype matching `self.values` and shape
            `[batch_size, alignments_size]`
            (`alignments_size` is memory's `max_time`).

        Returns:
          alignments: Tensor of dtype matching `self.values` and shape
            `[batch_size, alignments_size]` (`alignments_size` is memory's
            `max_time`).
          next_state: Same as the alignments.
        """
        score = _luong_score(query, self.keys, self.scale_weight)
        alignments = self.probability_fn(score, state)
        next_state = alignments
        return alignments, next_state

    def get_config(self):
        config = {
            "units": self.units,
            "scale": self.scale,
            "probability_fn": self.probability_fn_name,
        }
        base_config = super().get_config()
        return {**base_config, **config}

    @classmethod
    def from_config(cls, config, custom_objects=None):
        config = AttentionMechanism.deserialize_inner_layer_from_config(
            config, custom_objects=custom_objects
        )
        return cls(**config)

从源码中我们不难找到score的具体计算方式:

def _luong_score(query, keys, scale):
	'''前面代码省略,只看score计算代码'''
	score = tf.matmul(query, keys, transpose_b=True)
    score = tf.squeeze(score, [1])
	# scale为权重矩阵,可选
    if scale is not None:
        score = scale * score
    return score

class LuongAttention(AttentionMechanism):
'''代码省略'''
	score = _luong_score(query, self.keys, self.scale_weight)
    alignments = self.probability_fn(score, state)  # 默认softmax
    next_state = alignments
    return alignments, next_state
'''代码省略'''

使用方式:

 tfa.seq2seq.LuongAttention(
 		# 神经元个数,也是最终attention的输出维度
	    units: tfa.types.TensorLike,
	    # 可选,The memory to query,一般为RNN encoder的输出。维度为[batch_size, max_time, ...]
	    memory: Optional[TensorLike] = None,
	    # 可选参数。批次的序列长度。 主要是用来mask,超过相应真实的序列长度的值,补零,具体可查看mask层介绍
	    memory_sequence_length: Optional[TensorLike] = None,
	    scale: bool = False, #是否添加权重W
	    probability_fn: str = 'softmax', #default
	    dtype: tfa.types.AcceptableDTypes = None, # 数据类型
	    name: str = 'LuongAttention',
	    **kwargs
	)

需要说明几点:

  1. units参数:
    神经元节点数,我们知道在计算score的时候,需要使用 Decoder的 h t − 1 h_{t-1} ht1 h t h_{t} ht 和Encoder的 h ~ s \\tilde{h}_{s} h~s 来进行计算,而二者的维度可能并不是统一的,需要进行变换和统一,所以这里就有了 Wa 和 Ua 这两个系数,所以在代码中就是用 units 来声明了一个全连接 Dense 网络,用于统一二者的维度,以便于下一步的计算,从代码我们可以看到:
memory_layer = kwargs.pop("memory_layer", None)
if not memory_layer:
    memory_layer = tf.keras.layers.Dense(
        units, name="memory_layer", use_bias=False, dtype=dtype
    )
  1. scale参数:
    源码已经做出了解释:The second is the scaled form inspired partly by the normalized form of Bahdanau attention.To enable the second form, construct the object with parameter scale=True.
if self.scale and self.scale_weight is None:
	self.scale_weight = self.add_weight(
	     "attention_g", initializer=tf.ones_initializer, shape=()
	 )

Bahdanau Attention

Bahdanau Attention和LuongAttention的区别上面已经说明了,接下来直接上源码:

def _bahdanau_score(
    processed_query, keys, attention_v, attention_g=None, attention_b=None
):
    """Implements Bahdanau-style (additive) scoring function.

    This attention has two forms.  The first is Bahdanau attention,
    as described in:

    Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio.
    "Neural Machine Translation by Jointly Learning to Align and Translate."
    ICLR 2015. https://arxiv.org/abs/1409.0473

    The second is the normalized form.  This form is inspired by the
    weight normalization article:

    Tim Salimans, Diederik P. Kingma.
    "Weight Normalization: A Simple Reparameterization to Accelerate
     Training of Deep Neural Networks."
    https://arxiv.org/abs/1602.07868

    To enable the second form, set please pass in attention_g and attention_b.

    Args:
      processed_query: Tensor, shape `[batch_size, num_units]` to compare to
        keys.
      keys: Processed memory, shape `[batch_size, max_time, num_units]`.
      attention_v: Tensor, shape `[num_units]`.
      attention_g: Optional scalar tensor for normalization.
      attention_b: Optional tensor with shape `[num_units]` for normalization.

    Returns:
      A `[batch_size, max_time]` tensor of unnormalized score values.
    """
    # Reshape from [batch_size, ...] to [batch_size, 1, ...] for broadcasting.
    processed_query = tf.expand_dims(processed_query, 1)
    if attention_g is not None and attention_b is not None:
        normed_v = (
            attention_g
            * attention_v
            * tf.math.rsqrt(tf.reduce_sum(tf.square(attention_v)))
        )
        return tf.reduce_sum(
            normed_v * tf.tanh(keys + processed_query + attention_b), [2]
        )
    else:
        return tf.reduce_sum(attention_v * tf.tanh(keys + processed_query), [2])
class BahdanauAttention(AttentionMechanism):
    """Implements Bahdanau-style (additive) attention.

    This attention has two forms.  The first is Bahdanau attention,
    as described in:

    Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio.
    "Neural Machine Translation by Jointly Learning to Align and Translate."
    ICLR 2015. https://arxiv.org/abs/1409.0473

    The second is the normalized form.  This form is inspired by the
    weight normalization article:

    Tim Salimans, Diederik P. Kingma.
    "Weight Normalization: A Simple Reparameterization to Accelerate
     Training of Deep Neural Networks."
    https://arxiv.org/abs/1602.07868

    To enable the second form, construct the object with parameter
    `normalize=True`.
    """

    @typechecked
    def __init__(
        self,
        units: TensorLike,
        memory: Optional[TensorLike] = None,
        memory_sequence_length: Optional[TensorLike] = None,
        normalize: bool = False,
        probability_fn: str = "softmax",
        kernel_initializer: Initializer = "glorot_uniform",
        dtype: AcceptableDTypes = None,
        name: str = "BahdanauAttention",
        **kwargs,
    ):
        """Construct the Attention mechanism.

        Args:
          units: The depth of the query mechanism.
          memory: The memory to query; usually the output of an RNN encoder.
            This tensor should be shaped `[batch_size, max_time, ...]`.
          memory_sequence_length: (optional): Sequence lengths for the batch
            entries in memory.  If provided, the memory tensor rows are masked
            with zeros for values past the respective sequence lengths.
          normalize: Python boolean.  Whether to normalize the energy term.
          probability_fn: (optional) string

以上是关于3w字的分享!Attention实现详细解析( tfa, keras 方法调用源码分析 & 自建网络)的主要内容,如果未能解决你的问题,请参考以下文章

多图+公式全面解析RNN,LSTM,Seq2Seq,Attention注意力机制

指针的这些知识你知道吗?C语言超硬核指针进阶版3w+字详解+指针笔试题画图+文字详细讲解

Python深度学习12——Keras实现注意力机制(self-attention)中文的文本情感分类(详细注释)

爆火的Transformer,到底火在哪?

解析广泛应用于NLP的自注意力机制(附论文源码)

极智AI | Attention 中 torch.chunk 的 TensorRT 实现