The Transformer

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了The Transformer相关的知识,希望对你有一定的参考价值。

参考技术A

Transformer是个叠加的“自注意力机制(Self Attention)”构成的深度网络,是目前NLP里最强的特征提取器。
论文: Attention Is All You Need

整体上还是由Encoders和Decoders两部分组成的,而每一个部分是由6个Encoder和Decoder堆栈成的,每个的结构完全相同,但不共享权重。

每个Encoder由两部分组成:Multi-head self-attention层和Feed Forward NN层。

每个Decoder由三部分组成:Multi-head self-attention层,Encoder-Decoder Attention层和Feed Forward NN层。

动机:当模型处理每个单词(输入序列中的每个位置)时,self-attention允许它查看输入序列中的其他位置以寻找可以帮助导致对该单词更好的编码的线索。

使用矩阵形式可以并行计算。
图示

动机:将信息映射到不同的子空间,可能会抓取到不同位置的注意信息。
按照self-attention方式进行相同的几次计算(论文中使用8头),每次使用不同的权重矩阵( , 和 ),最终会得到几个不同的 矩阵,将它们直接拼接起来得到一个很长的矩阵 ,再乘以一个参数矩阵 将矩阵压缩到低维(同Embedding维数)。

单词顺序是NLP中非常重要的信息,所以加入Position encoding是考虑输入序列中单词顺序的一种方法。将位置编码与Embedding向量直接加起来得到真正的单词输入向量。
论文中给出了两个位置编码公式:

该层为简单的全连接层,使用了RELU激活函数,论文中该全连接的隐藏层维数为2048,公式如下:

在每一个子层的结束,输出矩阵为 ,我们将该层的输入矩阵 和 直接相加,再做Normalize操作 ,该Norm函数引用了 参考文献1: Layer Normalization 。
Norm方法有很多,但它们都有一个共同的目的,那就是把输入转化成均值为0方差为1的数据。我们在把数据送入激活函数之前进行normalization,因为我们不希望输入数据落在激活函数的饱和区。

该层是一个简单的全连接网络,将最后一个Decoder输出的向量投影到一个更高维度的空间去(词典维数)。
softmax层将Linear层的输出向量转化为概率输出,选择最大概率的单词作为输出。

Encoders最后将 和 输出给每个Decoder的Encoder-Decoder层:

Padding mask在所有的scaled dot-product attention里面都需要用到,而Sequence mask只有在Decoder的self-attention里面用到。

语料库中每个句子的长度是不同的,我们需要对齐。使用我们设置的阈值(一般为255),对于较长的序列,直接截取左边的序列,对于较短的序列,在其后添加0。
而在scaled dot-product attention中,不能对这部分添加了0的单词位置加上较高的注意力,所以在self-attention中的softmax之前,直接将这些位置的值设为 ,经过softmax后这些位置的概率值会变为0。
即下图中的 Mask(opt.) 块:

Sequence mask是为了使得Decoder不能看见未来的信息,使得解码器的attention只能关注当前解码单词之前的输出单词,而不能依赖后面未解码出来的。
所以跟Padding mask一样,对其后的单词位置直接设为 ,经过softmax后这些位置的概率值会变为0。
这步操作对应Decoder中第一个构件:Masked Multi-head Attention。

使用交叉熵或者KL散度去比较两个输出之间的差距,然后使用反向传播优化其中的所有参数。

在最后的softmax层我们直接输出了最大值位置的单词,叫做贪婪解码。
另一种更合理的解码方式叫做 束搜索 。假设第1#位置解码出的概率值,前两大的位置单词为 I 和 me ,那么在第2#位置解码时,依赖的第1#位置单词分别取为 I 和 me ,分别跑两次算法,在其中再选两个得分最高(或误差最小)的结果,依次类推。最终会得到两个得分最高的序列。

(转)The Evolved Transformer - Enhancing Transformer with Neural Architecture Search

The Evolved Transformer - Enhancing Transformer with Neural Architecture Search

2019-03-26 19:14:33

A new paper by Google Brain presents the first NAS to improve Transformer, one of the leading architecture for many Natural Language Processing tasks. The paper uses an evolution-based algorithm, with a novel approach to speed up the search process, to mutate the Transformer architecture to discover a better one?—?The Evolved Transformer (ET). The new architecture performs better than the original Transformer, especially when comparing small mobile-friendly models, and requires less training time. The concepts presented in the paper, such as the use of NAS to evolve human-designed models, has the potential to help researchers improve their architectures in many other areas.

Background

Transformers, first suggested in 2017, introduced an attention mechanism that processes the entire text input simultaneously to learn contextual relations between words. A Transformer includes two parts?—?an encoder that reads the text input and generates a lateral representation of it (e.g. a vector for each word), and a decoder that produces the translated text from that representation. The design has proven to be very effective and many of today’s state-of-the-art models (e.g. BERT, GPT-2) are based on Transformers. An in-depth review of Transformers can be found here.

While the Transformer’s architecture was hand-crafted manually by talented researchers, an alternative is to use search algorithms. Their goal is to find the best architecture in the given search space?—?A space that defines the constraints of any model in it, such as number of layers, maximum number of parameters, etc. A known search algorithm is the evolution-based algorithm, Tournament Selection, in which the fittest architectures survive and mutate while the weakest die. The advantage of this algorithm is its simplicity while still being efficient. The paper relies on a version presented in Real et al. (see pseudo-code in Appendix A):

  1. The first pool of models is initialized by randomly sampling the search space or by using a known model as a seed.
  2. These models are trained for the given task and randomly sampled to create subpopulation.
  3. The best models are mutated by randomly changing a small part of their architecture, such as replacing a layer or changing the connection between two layers.
  4. The mutated models (child models) are added to the pool while the weakest model from the subpopulation is removed from the pool.

Defining the search space is an additional challenge when solving a search problem. If the space is too broad and undefined, the algorithm might not converge and find a better model in a reasonable amount of time. On the other hand, a space that is too narrow reduces the probability of finding an innovative model that outperforms the hand-crafted ones. The NASNet search architecture approaches this challenge by defining “stackable cells”. A cell can contain a set of operations on its input (e.g. convolution) from a predefined vocabulary and the model is built by stacking the same cell architecture several times. The goal of the search algorithm is only to find the best architecture of a cell.

 
技术图片
An example of the NASNet search architecture for image classification task that contains two types of stackable cells (Normal and Reduction Cell). Source: Zoph et al.

How Evolved Transformer (ET) works

As the Transformer architecture has proven itself numerous times, the goal of the authors was to use a search algorithm to evolve it into an even better model. As a result, the model frame and the search space were designed to fit the original Transformer architecture in the following way:

  1. The algorithm searches for two types of cells?—?one for the encoder with six copies (blocks) and another for the decoder with eight copies.
  2. Each block includes two branches of operations as shown in the following chart. For example, the inputs are any two outputs of the previous layers (blocks), a layer can be a standard convolution, attention head (see Transformer), etc, and activation can be ReLU and Leaky ReLU. Some elements can also be an identity operation or a dead-end.
  3. Each cell can be repeated up to six times.

In total, the search space adds up to ~7.3 * 10115 optional models. A detailed description of the space can be found in the appendix of the paper.

 
技术图片
ET Stackable Cell format. Source: ET

Progressive Dynamic Hurdles (PDH)

Searching the entire space might take too long if the training and evaluation of each model are prolonged. It’s possible to overcome this problem in the field of image classification by performing the search on a proxy task, such as training a smaller dataset (e.g. CIFAR-10) before testing on a bigger dataset such as ImageNet. However, the authors couldn’t find an equivalent solution for translation models and therefore introduced an upgraded version of the tournament selection algorithm.

Instead of training each model in the pool on the entire dataset, a process that takes ~10 hours on a single TPU, the training is done gradually and only for the best models in the pool. The models in the pool are trained on a given amount of samples and more models are created according to the original tournament selection algorithm. Once there are enough models in the pool, a “fitness” threshold is calculated and only the models with better results (fitness) continue to the next step. These models will be trained on another batch of samples and the next models will be created and mutated based on them. As a result, PDH significantly reduces the training time spent on failing models and increases search efficiency. The downside is that “slow starters”, models that need more samples to achieve good results, might be missed.

 
技术图片
An example of the tournament selection training process. The models above the fitness threshold are trained on more sample and therefore reach better fitness. The fitness threshold increases in steps as new models are created. Source: ET

To “help” the search achieve high-quality results the authors initialized the search with the Transformer model instead of a complete random model. This step is necessary due to computing resources constraints. The table below compares the performance of the best model (using the perplexity metric, the lower the better) of different search techniques?—?Transformer vs. random initialization and PDH vs. regular tournament selection (with a given number of training steps per model).

 
技术图片
Comparison of different search techniques. Source: ET

The authors kept the total training time of each technique fixed and therefore the number of models differs: more training steps per model -> fewer total number of models can be searched and vice-versa. The PDH technique achieves the best results on average while being more stable (low variance). When reducing the number of training steps (30K), the regular technique performs almost as good on average as PDH. However, it suffers from a higher variance as it’s more prone to mistakes in the search process.

Results

The paper uses the described search space and PDH to find a translation model that performs well on known datasets such as WMT ’18. The search algorithm ran on 15,000 models using 270 TPUs for a total of almost 1 billion steps, while without PDH the total required steps would have been 3.6 billion. The best model found was named The Evolved Transformer (ET) and achieved better results compared to the original Transformer (perplexity of 3.95 vs 4.05) and required less training time. Its encoder and decoder block architectures are shown in the following chart (compared to the original ones).

 
技术图片
Transformer and ET encoder and decoder architectures. Source: ET

While some of the ET components are similar to the original one, others are less conventional such as depth-wise separable convolutions, which is more parameter efficient but less powerful compared to normal convolution. Another interesting example is the use of parallel branches (e.g. two convolution and RELU layers for the same input) in both the decoder and the encoder. The authors also discovered in an ablation study that the superior performance cannot be attributed to any single mutation of the ET compared to the Transformer.

Both ET and Transformer are heavy models with over 200 million parameters. Their size can be reduced by changing the input embedding (i.e. a word vector) size and the rest of the layers accordingly. Interestingly, the smaller the model the bigger the advantage of ET over Transformer. For example, for the smallest model with only 7 million parameters, ET outperforms Transformer by 1 perplexity point (7.62 vs 8.62).

 
技术图片
Comparison of Transformer and ET for different model sizes (according to embedding size). FLOPS represents the training duration of the model. Source: ET

Implementation details

As mentioned, the search algorithm required used over 200 Google’s TPUs in order to train thousands of models in a reasonable time. The training of the final ET model itself is faster than the original Transformer but still takes hours with a single TPU on the WMT’14 En-De dataset.

The code is open-source and is available for Tensorflow here.

Conclusion

Evolved Transformer shows the potential of combining hand-crafted with neural search algorithms to create architectures that are consistently better and faster to train. As computing resources are still limited (even for Google), researchers still need to carefully design the search space and improve the search algorithms to outperform human-designed models. However, this trend will undoubtedly just grow stronger over time.

To stay updated with the latest Deep Learning research, subscribe to my newsletter on LyrnAI

Appendix A - Tournament Selection Algorithm

The paper is based on the tournament selection algorithm from Real et al.except for the aging process of discarding the oldest models from the population:

 
技术图片
 

以上是关于The Transformer的主要内容,如果未能解决你的问题,请参考以下文章

simplify the design of the hardware forming the interface between the processor and thememory system

Word2010 Error:The name in the end tag of the element must match the element type in the start tag.

Word2010 Error:The name in the end tag of the element must match the element type in the start tag.

the major advances since the birth of the computer

The more... the more句型

刷新CollectionView 报错the item height must be less that the height of the UICollectionView minus the s(