LSTM CNN Transformer各有各的好处

Posted 2022-06-18 MrCharles

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了LSTM CNN Transformer各有各的好处相关的知识，希望对你有一定的参考价值。

I’ll list some bullet points of the main innovations introduced by transformers , followed by bullet points of the main characteristics of the other architectures you mentioned, so we can then compared them.

Transformers

Transformers (Attention is all you need) were introduced in the context of machine translation with the purpose to avoid recursion in order to allow parallel computation (to reduce training time) and also to reduce drops in performance due to long dependencies. The main characteristics are:

Non sequential: sentences are processed as a whole rather than word by word.
Self Attention: this is the newly introduced ‘unit’ used to compute similarity scores between words in a sentence.
Positional embeddings: another innovation introduced to replace recurrence. The idea is to use fixed or learned weights which encode information related to a specific position of a token in a sentence.
The first point is the main reason why transformer do not suffer from long dependency issues. The original transformers do not rely on past hidden states to capture dependencies with previous words. They instead process a sentence as a whole. That is why there is no risk to lose (or “forget”) past information. Moreover, multi-head attention and positional embeddings both provide information about the relationship between different words.

RNN / LSTM

Recurrent neural networks and Long-short term memory models, for what concerns this question, are almost identical in their core properties:

Sequential processing: sentences must be processed word by word.
Past information retained through past hidden states: sequence to sequence models follow the Markov property: each state is assumed to be dependent only on the previously seen state.
The first property is the reason why RNN and LSTM can’t be trained in parallel. In order to encode the second word in a sentence I need the previously computed hidden states of the first word, therefore I need to compute that first. The second property is a bit more subtle, but not hard to grasp conceptually. Information in RNN and LSTM are retained thanks to previously computed hidden states. The point is that the encoding of a specific word is retained only for the next time step, which means that the encoding of a word strongly affects only the representation of the next word, so its influence is quickly lost after a few time steps. LSTM (and also GruRNN) can boost a bit the dependency range they can learn thanks to a deeper processing of the hidden states through specific units (which comes with an increased number of parameters to train) but nevertheless the problem is inherently related to recursion. Another way in which people mitigated this problem is to use bi-directional models. These encode the same sentence from the start to end, and from the end to the start, allowing words at the end of a sentence to have stronger influence in the creation of the hidden representation. However, this is just a workaround rather than a real solution for very long dependencies.

CNN

Also convolutional neural networks are widely used in nlp since they are quite fast to train and effective with short texts. The way they tackle dependencies is by applying different kernels to the same sentence, and indeed since their first application to text (Convolutional Neural Networks for Sentence Classification) they were implement as multichannel CNN. Why do different kernels allow to learn dependencies? Because a kernel of size 2 for example would learn relationships between pairs of words, a kernel of size 3 would capture relationships between triplets of words and so on. The evident problem here is that the number of different kernels required to capture dependencies among all possible combinations of words in a sentence would be enormous and unpractical because of the exponential growing number of combinations when increasing the maximum length size of input sentences.

To summarize, Transformers are better than all the other architectures because they totally avoid recursion, by processing sentences as a whole and by learning relationships between words thanks to multi-head attention mechanisms and positional embeddings. Nevertheless, it must be pointed out that also transformers can capture only dependencies within the fixed input size used to train them, i.e. if I use as a maximum sentence size 50, the model will not be able to capture dependencies between the first word of a sentence and words that occur more than 50 words later, like in another paragraph. New transformers like Transformer-XL tries to overcome exactly this issue, by kinda re-introducing recursion by storing hidden states of already encoded sentences to leverage them in the subsequent encoding of the next sentences.

开发者涨薪指南

48位大咖的思考法则、工作方式、逻辑体系

以上是关于LSTM CNN Transformer各有各的好处的主要内容，如果未能解决你的问题，请参考以下文章