pytorch做seq2seq注意力模型的翻译
Posted www-caiyin-com
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了pytorch做seq2seq注意力模型的翻译相关的知识,希望对你有一定的参考价值。
以下是对pytorch 1.0版本 的seq2seq+注意力模型做法语--英语翻译的理解(这个代码在pytorch0.4上也可以正常跑):
1 # -*- coding: utf-8 -*- 2 """ 3 Translation with a Sequence to Sequence Network and Attention 4 ************************************************************* 5 **Author**: `Sean Robertson <https://github.com/spro/practical-pytorch>`_ 6 7 In this project we will be teaching a neural network to translate from 8 French to English. 9 10 :: 11 12 [KEY: > input, = target, < output] 13 14 > il est en train de peindre un tableau . 15 = he is painting a picture . 16 < he is painting a picture . 17 18 > pourquoi ne pas essayer ce vin delicieux ? 19 = why not try that delicious wine ? 20 < why not try that delicious wine ? 21 22 > elle n est pas poete mais romanciere . 23 = she is not a poet but a novelist . 24 < she not not a poet but a novelist . 25 26 > vous etes trop maigre . 27 = you re too skinny . 28 < you re all alone . 29 30 ... to varying degrees of success. 31 32 This is made possible by the simple but powerful idea of the `sequence 33 to sequence network <http://arxiv.org/abs/1409.3215>`__, in which two 34 recurrent neural networks work together to transform one sequence to 35 another. An encoder network condenses an input sequence into a vector, 36 and a decoder network unfolds that vector into a new sequence. 37 38 .. figure:: /_static/img/seq-seq-images/seq2seq.png 39 :alt: 40 41 To improve upon this model we‘ll use an `attention 42 mechanism <https://arxiv.org/abs/1409.0473>`__, which lets the decoder 43 learn to focus over a specific range of the input sequence. 44 45 **Recommended Reading:** 46 47 I assume you have at least installed PyTorch, know Python, and 48 understand Tensors: 49 50 - https://pytorch.org/ For installation instructions 51 - :doc:`/beginner/deep_learning_60min_blitz` to get started with PyTorch in general 52 - :doc:`/beginner/pytorch_with_examples` for a wide and deep overview 53 - :doc:`/beginner/former_torchies_tutorial` if you are former Lua Torch user 54 55 56 It would also be useful to know about Sequence to Sequence networks and 57 how they work: 58 59 - `Learning Phrase Representations using RNN Encoder-Decoder for 60 Statistical Machine Translation <http://arxiv.org/abs/1406.1078>`__ 61 - `Sequence to Sequence Learning with Neural 62 Networks <http://arxiv.org/abs/1409.3215>`__ 63 - `Neural Machine Translation by Jointly Learning to Align and 64 Translate <https://arxiv.org/abs/1409.0473>`__ 65 - `A Neural Conversational Model <http://arxiv.org/abs/1506.05869>`__ 66 67 You will also find the previous tutorials on 68 :doc:`/intermediate/char_rnn_classification_tutorial` 69 and :doc:`/intermediate/char_rnn_generation_tutorial` 70 helpful as those concepts are very similar to the Encoder and Decoder 71 models, respectively. 72 73 And for more, read the papers that introduced these topics: 74 75 - `Learning Phrase Representations using RNN Encoder-Decoder for 76 Statistical Machine Translation <http://arxiv.org/abs/1406.1078>`__ 77 - `Sequence to Sequence Learning with Neural 78 Networks <http://arxiv.org/abs/1409.3215>`__ 79 - `Neural Machine Translation by Jointly Learning to Align and 80 Translate <https://arxiv.org/abs/1409.0473>`__ 81 - `A Neural Conversational Model <http://arxiv.org/abs/1506.05869>`__ 82 83 84 **Requirements** 85 """ 86 from __future__ import unicode_literals, print_function, division 87 from io import open 88 import unicodedata 89 import string 90 import re 91 import random 92 93 import torch 94 import torch.nn as nn 95 from torch import optim 96 import torch.nn.functional as F 97 98 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 99 100 ###################################################################### 101 # Loading data files 102 # ================== 103 # 104 # The data for this project is a set of many thousands of English to 105 # French translation pairs. 106 # 107 # `This question on Open Data Stack 108 # Exchange <http://opendata.stackexchange.com/questions/3888/dataset-of-sentences-translated-into-many-languages>`__ 109 # pointed me to the open translation site http://tatoeba.org/ which has 110 # downloads available at http://tatoeba.org/eng/downloads - and better 111 # yet, someone did the extra work of splitting language pairs into 112 # individual text files here: http://www.manythings.org/anki/ 113 # 114 # The English to French pairs are too big to include in the repo, so 115 # download to ``data/eng-fra.txt`` before continuing. The file is a tab 116 # separated list of translation pairs: 117 # 118 # :: 119 # 120 # I am cold. J‘ai froid. 121 # 122 # .. Note:: 123 # Download the data from 124 # `here <https://download.pytorch.org/tutorial/data.zip>`_ 125 # and extract it to the current directory. 126 127 ###################################################################### 128 # Similar to the character encoding used in the character-level RNN 129 # tutorials, we will be representing each word in a language as a one-hot 130 # vector, or giant vector of zeros except for a single one (at the index 131 # of the word). Compared to the dozens of characters that might exist in a 132 # language, there are many many more words, so the encoding vector is much 133 # larger. We will however cheat a bit and trim the data to only use a few 134 # thousand words per language. 135 # 136 # .. figure:: /_static/img/seq-seq-images/word-encoding.png 137 # :alt: 138 # 139 # 140 141 142 ###################################################################### 143 # We‘ll need a unique index per word to use as the inputs and targets of 144 # the networks later. To keep track of all this we will use a helper class 145 # called ``Lang`` which has word → index (``word2index``) and index → word 146 # (``index2word``) dictionaries, as well as a count of each word 147 # ``word2count`` to use to later replace rare words. 148 # 149 150 SOS_token = 0 151 EOS_token = 1 152 153 154 # 每个单词需要对应唯一的索引作为稍后的网络输入和目标.为了追踪这些索引 155 # 则使用一个帮助类 Lang ,类中有 词 → 索引 (word2index) 和 索引 → 词 156 # (index2word) 的字典, 以及每个词word2count 用来替换稀疏词汇. 157 158 159 # 此处创建的Lang 对象来表示源/目标语言,它包含三部分:word2index、 160 # index2word 和word2count,分别表示单词到id、id 到单词和单词的词频。 161 # word2count的作用是用于过滤一些低频词(把它变成unknown) 162 163 class Lang: 164 def __init__(self, name): 165 self.name = name 166 self.word2index = {} 167 self.word2count = {} 168 self.index2word = {0: "SOS", 1: "EOS"} 169 self.n_words = 2 # Count SOS and EOS 170 171 def addSentence(self, sentence): 172 for word in sentence.split(‘ ‘): 173 self.addWord(word) # 用于添加单词 174 175 def addWord(self, word): 176 if word not in self.word2index: # 是不是新的词 177 # 如果不在word2index里,则需要新的定义字典 178 self.word2index[word] = self.n_words 179 self.word2count[word] = 1 180 self.index2word[self.n_words] = word 181 self.n_words += 1 # 相当于每次index+1 182 else: 183 self.word2count[word] += 1 # 计算每次词的个数 184 185 186 ###################################################################### 187 # The files are all in Unicode, to simplify we will turn Unicode 188 # characters to ASCII, make everything lowercase, and trim most 189 # punctuation. 190 # 191 192 # Turn a Unicode string to plain ASCII, thanks to 193 # http://stackoverflow.com/a/518232/2809427 194 195 # 此处是为了将Unicode字符串转换为纯ASCII 196 # 原文件是Unicode编码 197 def unicodeToAscii(s): 198 return ‘‘.join( 199 c for c in unicodedata.normalize(‘NFD‘, s) 200 if unicodedata.category(c) != ‘Mn‘ 201 ) 202 203 204 # Lowercase, trim, and remove non-letter characters 205 206 # 小写,修剪和删除非字母字符 207 def normalizeString(s): 208 s = unicodeToAscii(s.lower().strip()) 209 s = re.sub(r"([.!?])", r" 1", s) 210 s = re.sub(r"[^a-zA-Z.!?]+", r" ", s) 211 return s 212 213 214 ###################################################################### 215 # To read the data file we will split the file into lines, and then split 216 # lines into pairs. The files are all English → Other Language, so if we 217 # want to translate from Other Language → English I added the ``reverse`` 218 # flag to reverse the pairs. 219 # 220 221 222 # 要读取数据文件,我们将把文件分成行,然后将行成对分开. 这些文件 223 # 都是英文→其他语言,所以如果我们想从其他语言翻译→英文,我们添加了 224 # 翻转标志 reverse来翻转词语对. 225 def readLangs(lang1, lang2, reverse=False): 226 print("Reading lines...") 227 228 # Read the file and split into lines 229 # 读取文件并按行分开 230 lines = open(‘data/%s-%s.txt‘ % (lang1, lang2), encoding=‘utf-8‘). 231 read().strip().split(‘ ‘) 232 233 # Split every line into pairs and normalize 234 # 将每一行分成两列并进行标准化 235 pairs = [[normalizeString(s) for s in l.split(‘ ‘)] for l in lines] 236 237 # Reverse pairs, make Lang instances 238 # 翻转对,Lang实例化 239 if reverse: 240 pairs = [list(reversed(p)) for p in pairs] 241 input_lang = Lang(lang2) 242 output_lang = Lang(lang1) 243 else: 244 input_lang = Lang(lang1) 245 output_lang = Lang(lang2) 246 247 return input_lang, output_lang, pairs 248 249 250 ###################################################################### 251 # Since there are a *lot* of example sentences and we want to train 252 # something quickly, we‘ll trim the data set to only relatively short and 253 # simple sentences. Here the maximum length is 10 words (that includes 254 # ending punctuation) and we‘re filtering to sentences that translate to 255 # the form "I am" or "He is" etc. (accounting for apostrophes replaced 256 # earlier). 257 # 258 259 # 由于例句较多,为了方便快速训练,则会将数据集裁剪为相对简短的句子. 260 # 这里的单词的最大长度是10词(包括结束标点符号), 261 # 保留”I am” 和”He is” 开头的数据 262 263 MAX_LENGTH = 10 264 265 eng_prefixes = ( 266 "i am ", "i m ", 267 "he is", "he s ", 268 "she is", "she s", 269 "you are", "you re ", 270 "we are", "we re ", 271 "they are", "they re " 272 ) 273 274 275 def filterPair(p): 276 return len(p[0].split(‘ ‘)) < MAX_LENGTH and 277 len(p[1].split(‘ ‘)) < MAX_LENGTH and 278 p[1].startswith(eng_prefixes) 279 # 是否满足长度 280 281 282 def filterPairs(pairs): 283 return [pair for pair in pairs if filterPair(pair)] 284 285 286 ###################################################################### 287 # The full process for preparing the data is: 288 # 289 # - Read text file and split into lines, split lines into pairs 290 # - Normalize text, filter by length and content 291 # - Make word lists from sentences in pairs 292 # 293 294 def prepareData(lang1, lang2, reverse=False): 295 input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse) 296 # 读入数据lang1,lang2,并翻转 297 print("Read %s sentence pairs" % len(pairs)) 298 # 一共读入了多少对 299 pairs = filterPairs(pairs) 300 # 符合条件的配对有多少对 301 print("Trimmed to %s sentence pairs" % len(pairs)) 302 print("Counting words...") 303 for pair in pairs: 304 input_lang.addSentence(pair[0]) 305 output_lang.addSentence(pair[1]) 306 print("Counted words:") 307 print(input_lang.name, input_lang.n_words) 308 print(output_lang.name, output_lang.n_words) 309 return input_lang, output_lang, pairs 310 311 312 # 对数据进行预处理 313 input_lang, output_lang, pairs = prepareData(‘eng‘, ‘fra‘, True) 314 print(random.choice(pairs)) # 随机展示一对 315 316 317 ###################################################################### 318 # The Seq2Seq Model 319 # ================= 320 # 321 # A Recurrent Neural Network, or RNN, is a network that operates on a 322 # sequence and uses its own output as input for subsequent steps. 323 # 324 # A `Sequence to Sequence network <http://arxiv.org/abs/1409.3215>`__, or 325 # seq2seq network, or `Encoder Decoder 326 # network <https://arxiv.org/pdf/1406.1078v3.pdf>`__, is a model 327 # consisting of two RNNs called the encoder and decoder. The encoder reads 328 # an input sequence and outputs a single vector, and the decoder reads 329 # that vector to produce an output sequence. 330 # 331 # .. figure:: /_static/img/seq-seq-images/seq2seq.png 332 # :alt: 333 # 334 # Unlike sequence prediction with a single RNN, where every input 335 # corresponds to an output, the seq2seq model frees us from sequence 336 # length and order, which makes it ideal for translation between two 337 # languages. 338 # 339 # Consider the sentence "Je ne suis pas le chat noir" → "I am not the 340 # black cat". Most of the words in the input sentence have a direct 341 # translation in the output sentence, but are in slightly different 342 # orders, e.g. "chat noir" and "black cat". Because of the "ne/pas" 343 # construction there is also one more word in the input sentence. It would 344 # be difficult to produce a correct translation directly from the sequence 345 # of input words. 346 # 347 # With a seq2seq model the encoder creates a single vector which, in the 348 # ideal case, encodes the "meaning" of the input sequence into a single 349 # vector — a single point in some N dimensional space of sentences. 350 # 351 352 353 ###################################################################### 354 # The Encoder 355 # ----------- 356 # 357 # The encoder of a seq2seq network is a RNN that outputs some value for 358 # every word from the input sentence. For every input word the encoder 359 # outputs a vector and a hidden state, and uses the hidden state for the 360 # next input word. 361 # 362 # .. figure:: /_static/img/seq-seq-images/encoder-network.png 363 # :alt: 364 # 365 # 366 367 class EncoderRNN(nn.Module): 368 def __init__(self, input_size, hidden_size): 369 super(EncoderRNN, self).__init__() 370 self.hidden_size = hidden_size 371 # 定义隐藏层 372 self.embedding = nn.Embedding(input_size, hidden_size) 373 # word embedding的定义可以这么理解,例如nn.Embedding(2, 4) 374 # 2表示有2个词,4表示4维度,其实也就是一个2x4的矩阵, 375 # 如果有100个词,每个词10维,就可以写为nn.Embedding(100, 10) 376 # 注意这里的词向量的建立只是初始的词向量,并没有经过任何修改优化 377 # 需要建立神经网络通过learning的办法修改word embedding里面的参数 378 # 使得word embedding每一个词向量能够表示每一个不同的词。 379 self.gru = nn.GRU(hidden_size, hidden_size) # 用到了上面提到的GRU模型 380 381 def forward(self, input, hidden): 382 embedded = self.embedding(input).view(1, 1, -1) # -1是指自适应,view相当于reshape函数 383 output = embedded 384 output, hidden = self.gru(output, hidden) 385 return output, hidden 386 387 def initHidden(self): # 初始化 388 return torch.zeros(1, 1, self.hidden_size, device=device) 389 390 391 ###################################################################### 392 # The Decoder 393 # ----------- 394 # 395 # The decoder is another RNN that takes the encoder output vector(s) and 396 # outputs a sequence of words to create the translation. 397 # 398 399 400 ###################################################################### 401 # Simple Decoder 402 # ^^^^^^^^^^^^^^ 403 # 404 # In the simplest seq2seq decoder we use only last output of the encoder. 405 # This last output is sometimes called the *context vector* as it encodes 406 # context from the entire sequence. This context vector is used as the 407 # initial hidden state of the decoder. 408 # 409 # At every step of decoding, the decoder is given an input token and 410 # hidden state. The initial input token is the start-of-string ``<SOS>`` 411 # token, and the first hidden state is the context vector (the encoder‘s 412 # last hidden state). 413 # 414 # .. figure:: /_static/img/seq-seq-images/decoder-network.png 415 # :alt: 416 # 417 # 418 419 class DecoderRNN(nn.Module): 420 # DecoderRNN与encoderRNN结构类似,结合图片即可搞清逻辑 421 def __init__(self, hidden_size, output_size): 422 super(DecoderRNN, self).__init__() 423 self.hidden_size = hidden_size 424 425 self.embedding = nn.Embedding(output_size, hidden_size) 426 self.gru = nn.GRU(hidden_size, hidden_size) 427 self.out = nn.Linear(hidden_size, output_size) 428 self.softmax = nn.LogSoftmax(dim=1) 429 430 def forward(self, input, hidden): 431 output = self.embedding(input).view(1, 1, -1) # -1是指自适应,view相当于reshape函数 432 output = F.relu(output) 433 output, hidden = self.gru(output, hidden) # 此处使用gru神经网络 434 # 对上述结果使用softmax,就是图片中左边倒数第二个 435 output = self.softmax(self.out(output[0])) 436 return output, hidden 437 438 def initHidden(self): 439 return torch.zeros(1, 1, self.hidden_size, device=device) 440 441 442 ###################################################################### 443 # I encourage you to train and observe the results of this model, but to 444 # save space we‘ll be going straight for the gold and introducing the 445 # Attention Mechanism. 446 # 447 448 449 ###################################################################### 450 # Attention Decoder 451 # ^^^^^^^^^^^^^^^^^ 452 # 453 # If only the context vector is passed betweeen the encoder and decoder, 454 # that single vector carries the burden of encoding the entire sentence. 455 # 456 # Attention allows the decoder network to "focus" on a different part of 457 # the encoder‘s outputs for every step of the decoder‘s own outputs. First 458 # we calculate a set of *attention weights*. These will be multiplied by 459 # the encoder output vectors to create a weighted combination. The result 460 # (called ``attn_applied`` in the code) should contain information about 461 # that specific part of the input sequence, and thus help the decoder 462 # choose the right output words. 463 # 464 # .. figure:: https://i.imgur.com/1152PYf.png 465 # :alt: 466 # 467 # Calculating the attention weights is done with another feed-forward 468 # layer ``attn``, using the decoder‘s input and hidden state as inputs. 469 # Because there are sentences of all sizes in the training data, to 470 # actually create and train this layer we have to choose a maximum 471 # sentence length (input length, for encoder outputs) that it can apply 472 # to. Sentences of the maximum length will use all the attention weights, 473 # while shorter sentences will only use the first few. 474 # 475 # .. figure:: /_static/img/seq-seq-images/attention-decoder-network.png 476 # :alt: 477 # 478 # 479 480 class AttnDecoderRNN(nn.Module): 481 def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH): 482 super(AttnDecoderRNN, self).__init__() 483 self.hidden_size = hidden_size 484 self.output_size = output_size 485 self.dropout_p = dropout_p 486 self.max_length = max_length 487 488 self.embedding = nn.Embedding(self.output_size, self.hidden_size) 489 self.attn = nn.Linear(self.hidden_size * 2, self.max_length) 490 self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size) 491 self.dropout = nn.Dropout(self.dropout_p) 492 self.gru = nn.GRU(self.hidden_size, self.hidden_size) 493 self.out = nn.Linear(self.hidden_size, self.output_size) 494 495 def forward(self, input, hidden, encoder_outputs): 496 # 对于输入的input内容进行embedding和dropout操作 497 # dropout是指随机丢弃一些神经元 498 embedded = self.embedding(input).view(1, 1, -1) 499 embedded = self.dropout(embedded) 500 501 # 此处相当于学出来了attention的权重 502 # 需要注意的是torch的concatenate函数是torch.cat,是在已有的维度上拼接, 503 # 而stack是建立一个新的维度,然后再在该纬度上进行拼接。 504 attn_weights = F.softmax( 505 self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1) 506 507 # 将attention权重作用在encoder_outputs上 508 # 对存储在两个批batch1和batch2内的矩阵进行批矩阵乘操作。 509 # batch1和 batch2都为包含相同数量矩阵的3维张量。 510 # 如果batch1是形为b×n×m的张量,batch1是形为b×m×p的张量, 511 # 则out和mat的形状都是n×p 512 attn_applied = torch.bmm(attn_weights.unsqueeze(0), 513 encoder_outputs.unsqueeze(0)) 514 # 拼接操作,将embedded和attn_Applied拼接起来 515 output = torch.cat((embedded[0], attn_applied[0]), 1) 516 # 返回一个新的张量,对输入的制定位置插入维度 1 517 output = self.attn_combine(output).unsqueeze(0) 518 519 output = F.relu(output) 520 output, hidden = self.gru(output, hidden) 521 522 output = F.log_softmax(self.out(output[0]), dim=1) 523 return output, hidden, attn_weights 524 525 def initHidden(self): 526 return torch.zeros(1, 1, self.hidden_size, device=device) 527 528 529 ###################################################################### 530 # .. note:: There are other forms of attention that work around the length 531 # limitation by using a relative position approach. Read about "local 532 # attention" in `Effective Approaches to Attention-based Neural Machine 533 # Translation <https://arxiv.org/abs/1508.04025>`__. 534 # 535 # Training 536 # ======== 537 # 538 # Preparing Training Data 539 # ----------------------- 540 # 541 # To train, for each pair we will need an input tensor (indexes of the 542 # words in the input sentence) and target tensor (indexes of the words in 543 # the target sentence). While creating these vectors we will append the 544 # EOS token to both sequences. 545 # 546 547 def indexesFromSentence(lang, sentence): 548 return [lang.word2index[word] for word in sentence.split(‘ ‘)] 549 550 551 def tensorFromSentence(lang, sentence): 552 # 获得词的索引 553 indexes = indexesFromSentence(lang, sentence) 554 # 将EOS标记添加到两个序列中 555 indexes.append(EOS_token) 556 return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1) 557 558 559 def tensorsFromPair(pair): 560 # 每一对为需要输入的张量(输入句子中的词的索引)和目标张量 561 # (目标语句中的词的索引) 562 input_tensor = tensorFromSentence(input_lang, pair[0]) 563 target_tensor = tensorFromSentence(output_lang, pair[1]) 564 return (input_tensor, target_tensor) 565 566 567 ###################################################################### 568 # Training the Model 569 # ------------------ 570 # 571 # To train we run the input sentence through the encoder, and keep track 572 # of every output and the latest hidden state. Then the decoder is given 573 # the ``<SOS>`` token as its first input, and the last hidden state of the 574 # encoder as its first hidden state. 575 # 576 # "Teacher forcing" is the concept of using the real target outputs as 577 # each next input, instead of using the decoder‘s guess as the next input. 578 # Using teacher forcing causes it to converge faster but `when the trained 579 # network is exploited, it may exhibit 580 # instability <http://minds.jacobs-university.de/sites/default/files/uploads/papers/ESNTutorialRev.pdf>`__. 581 # 582 # You can observe outputs of teacher-forced networks that read with 583 # coherent grammar but wander far from the correct translation - 584 # intuitively it has learned to represent the output grammar and can "pick 585 # up" the meaning once the teacher tells it the first few words, but it 586 # has not properly learned how to create the sentence from the translation 587 # in the first place. 588 # 589 # Because of the freedom PyTorch‘s autograd gives us, we can randomly 590 # choose to use teacher forcing or not with a simple if statement. Turn 591 # ``teacher_forcing_ratio`` up to use more of it. 592 # 593 594 teacher_forcing_ratio = 0.5 595 596 597 # teacher forcing即指使用教师强迫其能够更快的收敛 598 # 不过当训练好的网络被利用时,容易表现出不稳定性 599 # teacher_forcing_ratio即指教师训练比率 600 # 用于训练的函数 601 602 603 def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, 604 max_length=MAX_LENGTH): 605 # encoder即指EncoderRNN(input_lang.n_words, hidden_size) 606 # attn_decoder即指 AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1) 607 # hidden=256 608 encoder_hidden = encoder.initHidden() 609 610 # encoder_optimizer 即指optim.SGD(encoder.parameters(), lr=learning_rate) 611 # decoder_optimizer 即指optim.SGD(decoder.parameters(), lr=learning_rate) 612 # nn.Parameter()是Variable的一种,常被用于模块参数(module parameter)。 613 # Parameters 是 Variable 的子类。Paramenters和Modules一起使用的时候会有一些特殊的属性, 614 # 即:当Paramenters赋值给Module的属性的时候,他会自动的被加到 Module的 参数列表中 615 # (即:会出现在 parameters() 迭代器中)。将Varibale赋值给Module属性则不会有这样的影响。 616 # 这样做的原因是:我们有时候会需要缓存一些临时的状态(state), 比如:模型中RNN的最后一个隐状态。 617 # 如果没有Parameter这个类的话,那么这些临时变量也会注册成为模型变量。 618 encoder_optimizer.zero_grad() 619 decoder_optimizer.zero_grad() 620 621 # 得到长度 622 input_length = input_tensor.size(0) 623 target_length = target_tensor.size(0) 624 625 # 初始化outour值 626 encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device) 627 628 loss = 0 629 630 # 以下循环是学习过程 631 for ei in range(input_length): 632 encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden) 633 encoder_outputs[ei] = encoder_output[0, 0] # 这里为什么取 0,0 634 635 # 定义decoder的Input值 636 decoder_input = torch.tensor([[SOS_token]], device=device) 637 638 decoder_hidden = encoder_hidden 639 640 use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False 641 642 if use_teacher_forcing: 643 # Teacher forcing: Feed the target as the next input 644 # 教师强制: 将目标作为下一个输入 645 # 你观察教师强迫网络的输出,这些网络是用连贯的语法阅读的,但却远离了正确的翻译 - 646 # 直观地来看它已经学会了代表输出语法,并且一旦老师告诉它前几个单词,就可以"拾取"它的意思, 647 # 但它没有适当地学会如何从翻译中创建句子. 648 for di in range(target_length): 649 # 通过decoder得到输出值 650 decoder_output, decoder_hidden, decoder_attention = decoder( 651 decoder_input, decoder_hidden, encoder_outputs) 652 # 定义损失函数并计算 653 loss += criterion(decoder_output, target_tensor[di]) 654 decoder_input = target_tensor[di] # Teacher forcing 655 656 else: 657 # Without teacher forcing: use its own predictions as the next input 658 # 没有教师强迫: 使用自己的预测作为下一个输入 659 for di in range(target_length): 660 # 通过decoder得到输出值 661 decoder_output, decoder_hidden, decoder_attention = decoder( 662 decoder_input, decoder_hidden, encoder_outputs) 663 664 # topk:第k个最小元素,返回第k个最小元素 665 # 返回前k个最大元素,注意是前k个,largest=False,返回前k个最小元素 666 # 此函数的功能是求取1-D 或N-D Tensor的最低维度的前k个最大的值,返回值为两个Tuple 667 # 其中values是前k个最大值的Tuple,indices是对应的下标,默认返回结果是从大到小排序的。 668 topv, topi = decoder_output.topk(1) 669 decoder_input = topi.squeeze().detach() # detach from history as input 670 671 loss += criterion(decoder_output, target_tensor[di]) 672 if decoder_input.item() == EOS_token: 673 break 674 # 反向传播 675 loss.backward() 676 677 # 更新参数 678 encoder_optimizer.step() 679 decoder_optimizer.step() 680 681 return loss.item() / target_length 682 683 684 ###################################################################### 685 # This is a helper function to print time elapsed and estimated time 686 # remaining given the current time and progress %. 687 # 688 689 import time 690 import math 691 692 693 # 根据当前时间和进度百分比,这是一个帮助功能,用于打印经过的时间和估计的剩余时间. 694 695 def asMinutes(s): 696 m = math.floor(s / 60) 697 s -= m * 60 698 return ‘%dm %ds‘ % (m, s) 699 700 701 def timeSince(since, percent): 702 now = time.time() 703 s = now - since 704 es = s / (percent) 705 rs = es - s 706 return ‘%s (- %s)‘ % (asMinutes(s), asMinutes(rs)) 707 708 709 ###################################################################### 710 # The whole training process looks like this: 711 # 712 # - Start a timer 713 # - Initialize optimizers and criterion 714 # - Create set of training pairs 715 # - Start empty losses array for plotting 716 # 717 # Then we call ``train`` many times and occasionally print the progress (% 718 # of examples, time so far, estimated time) and average loss. 719 # 720 721 def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01): 722 start = time.time() 723 plot_losses = [] 724 print_loss_total = 0 # Reset every print_every 725 plot_loss_total = 0 # Reset every plot_every 726 727 encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate) 728 decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate) 729 730 # 获取训练的一对样本 731 training_pairs = [tensorsFromPair(random.choice(pairs)) 732 for i in range(n_iters)] 733 # 定义出的损失函数 734 criterion = nn.NLLLoss() 735 736 for iter in range(1, n_iters + 1): 737 training_pair = training_pairs[iter - 1] 738 input_tensor = training_pair[0] 739 target_tensor = training_pair[1] 740 741 # 训练的过程并用于当损失函数 742 loss = train(input_tensor, target_tensor, encoder, 743 decoder, encoder_optimizer, decoder_optimizer, criterion) 744 print_loss_total += loss 745 plot_loss_total += loss 746 747 if iter % print_every == 0: 748 print_loss_avg = print_loss_total / print_every 749 print_loss_total = 0 750 # 打印进度(样本的百分比,到目前为止的时间,估计的时间)和平均损失. 751 print(‘%s (%d %d%%) %.4f‘ % (timeSince(start, iter / n_iters), 752 iter, iter / n_iters * 100, print_loss_avg)) 753 754 if iter % plot_every == 0: 755 plot_loss_avg = plot_loss_total / plot_every 756 plot_losses.append(plot_loss_avg) 757 plot_loss_total = 0 758 # 绘制图像 759 showPlot(plot_losses) 760 761 762 ###################################################################### 763 # Plotting results 764 # ---------------- 765 # 766 # Plotting is done with matplotlib, using the array of loss values 767 # ``plot_losses`` saved while training. 768 # 769 770 import matplotlib.pyplot as plt 771 772 plt.switch_backend(‘agg‘) 773 import matplotlib.ticker as ticker 774 import numpy as np 775 776 777 # 使用matplotlib进行绘图,使用训练时保存的损失值plot_losses数组. 778 def showPlot(points): 779 plt.figure() 780 fig, ax = plt.subplots() 781 # this locator puts ticks at regular intervals 782 # 这个定位器会定期发出提示信息 783 loc = ticker.MultipleLocator(base=0.2) 784 ax.yaxis.set_major_locator(loc) 785 plt.plot(points) 786 787 788 ###################################################################### 789 # Evaluation 790 # ========== 791 # 792 # Evaluation is mostly the same as training, but there are no targets so 793 # we simply feed the decoder‘s predictions back to itself for each step. 794 # Every time it predicts a word we add it to the output string, and if it 795 # predicts the EOS token we stop there. We also store the decoder‘s 796 # attention outputs for display later. 797 # 798 799 def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH): 800 with torch.no_grad(): 801 # 从sentence中得到对应的变量 802 input_tensor = tensorFromSentence(input_lang, sentence) 803 # 长度 804 input_length = input_tensor.size()[0] 805 806 # encoder即指EncoderRNN(input_lang.n_words, hidden_size) 807 # attn_decoder即指 AttnDecoderRNN(hidden_size, 808 # output_lang.n_words, dropout_p=0.1) 809 # hidden=256 810 encoder_hidden = encoder.initHidden() 811 812 # 初始化outputs值 813 encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device) 814 815 # 以下是学习过程 816 for ei in range(input_length): 817 encoder_output, encoder_hidden = encoder(input_tensor[ei], 818 encoder_hidden) 819 encoder_outputs[ei] += encoder_output[0, 0] 820 821 # 定义好decoder部分的input值 822 decoder_input = torch.tensor([[SOS_token]], device=device) # SOS 823 824 # 设置好隐藏层 825 decoder_hidden = encoder_hidden 826 827 decoded_words = [] 828 decoder_attentions = torch.zeros(max_length, max_length) 829 830 for di in range(max_length): 831 # 得到结果 832 decoder_output, decoder_hidden, decoder_attention = decoder(decoder_input, decoder_hidden, encoder_outputs) 833 834 # attention部分的数据 835 decoder_attentions[di] = decoder_attention.data 836 # 选择output中的第一个值 837 topv, topi = decoder_output.data.topk(1) 838 if topi.item() == EOS_token: 839 decoded_words.append(‘<EOS>‘) 840 break 841 else: 842 decoded_words.append(output_lang.index2word[topi.item()]) # 将output_lang添加到decoded 843 844 decoder_input = topi.squeeze().detach() 845 846 return decoded_words, decoder_attentions[:di + 1] 847 848 849 ###################################################################### 850 # We can evaluate random sentences from the training set and print out the 851 # input, target, and output to make some subjective quality judgements: 852 # 853 854 # 从训练集中评估随机的句子并打印出输入,目标和输出以作出一些主观质量判断 855 def evaluateRandomly(encoder, decoder, n=10): 856 for i in range(n): 857 pair = random.choice(pairs) 858 print(‘>‘, pair[0]) 859 print(‘=‘, pair[1]) 860 output_words, attentions = evaluate(encoder, decoder, pair[0]) 861 output_sentence = ‘ ‘.join(output_words) 862 print(‘<‘, output_sentence) 863 print(‘‘) 864 865 866 ###################################################################### 867 # Training and Evaluating 868 # ======================= 869 # 870 # With all these helper functions in place (it looks like extra work, but 871 # it makes it easier to run multiple experiments) we can actually 872 # initialize a network and start training. 873 # 874 # Remember that the input sentences were heavily filtered. For this small 875 # dataset we can use relatively small networks of 256 hidden nodes and a 876 # single GRU layer. After about 40 minutes on a MacBook CPU we‘ll get some 877 # reasonable results. 878 # 879 # .. Note:: 880 # If you run this notebook you can train, interrupt the kernel, 881 # evaluate, and continue training later. Comment out the lines where the 882 # encoder and decoder are initialized and run ``trainIters`` again. 883 # 884 885 hidden_size = 256 886 # 编码部分 887 encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device) 888 # 加入了attention机制的解码部分 889 attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(device) 890 # 训练部分 891 trainIters(encoder1, attn_decoder1, 75000, print_every=5000) 892 893 ###################################################################### 894 # 随机生成一组结果 895 evaluateRandomly(encoder1, attn_decoder1) 896 897 ###################################################################### 898 # Visualizing Attention 899 # --------------------- 900 # 901 # A useful property of the attention mechanism is its highly interpretable 902 # outputs. Because it is used to weight specific encoder outputs of the 903 # input sequence, we can imagine looking where the network is focused most 904 # at each time step. 905 # 906 # You could simply run ``plt.matshow(attentions)`` to see attention output 907 # displayed as a matrix, with the columns being input steps and rows being 908 # output steps: 909 # 910 911 output_words, attentions = evaluate(encoder1, attn_decoder1, "je suis trop froid .") 912 plt.matshow(attentions.numpy()) 913 914 915 ###################################################################### 916 # For a better viewing experience we will do the extra work of adding axes 917 # and labels: 918 919 def showAttention(input_sentence, output_words, attentions): 920 # Set up figure with colorbar 921 fig = plt.figure() 922 ax = fig.add_subplot(111) 923 cax = ax.matshow(attentions.numpy(), cmap=‘bone‘) 924 fig.colorbar(cax) 925 926 # Set up axes 927 ax.set_xticklabels([‘‘] + input_sentence.split(‘ ‘) + 928 [‘<EOS>‘], rotation=90) 929 ax.set_yticklabels([‘‘] + output_words) 930 931 # Show label at every tick 932 ax.xaxis.set_major_locator(ticker.MultipleLocator(1)) 933 ax.yaxis.set_major_locator(ticker.MultipleLocator(1)) 934 935 plt.show() 936 937 938 def evaluateAndShowAttention(input_sentence): 939 output_words, attentions = evaluate( 940 encoder1, attn_decoder1, input_sentence) 941 print(‘input =‘, input_sentence) 942 print(‘output =‘, ‘ ‘.join(output_words)) 943 showAttention(input_sentence, output_words, attentions) 944 945 946 evaluateAndShowAttention("elle a cinq ans de moins que moi .") 947 evaluateAndShowAttention("elle est trop petit .") 948 evaluateAndShowAttention("je ne crains pas de mourir .") 949 evaluateAndShowAttention("c est un jeune directeur plein de talent .") 950 951 ###################################################################### 952 # Exercises 953 # ========= 954 # 955 # - Try with a different dataset 956 # 957 # - Another language pair 958 # - Human → Machine (e.g. IOT commands) 959 # - Chat → Response 960 # - Question → Answer 961 # 962 # - Replace the embeddings with pre-trained word embeddings such as word2vec or 963 # GloVe 964 # - Try with more layers, more hidden units, and more sentences. Compare 965 # the training time and results. 966 # - If you use a translation file where pairs have two of the same phrase 967 # (``I am test I am test``), you can use this as an autoencoder. Try 968 # this: 969 # 970 # - Train as an autoencoder 971 # - Save only the Encoder network 972 # - Train a new Decoder for translation from there 973 #
以上是关于pytorch做seq2seq注意力模型的翻译的主要内容,如果未能解决你的问题,请参考以下文章
Pytorch系列教程-使用Seq2Seq网络和注意力机制进行机器翻译
PyTorch-16 seq2seq translation 使用序列到序列的网络和注意机制进行翻译
[翻译] 可视化神经网络机器翻译模型(Seq2Seq模型的注意力机制)