Pytorch:预测姓名的所属国家
Posted CollectTime
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Pytorch:预测姓名的所属国家相关的知识,希望对你有一定的参考价值。
目的:通过建立RNN模型,给定一个姓名,预测该姓名属于哪一个国家。
方法:首先将字母进行独热编码,然后输入给RNN,output与target比较并且训练网络。
数据:https://download.pytorch.org/tutorial/data.zip
里面有各种国家的姓名,都是使用英文表示的。
用python读取数据,同时将每一个国家的数据全部读取到一个列表中,全部使用字典进行表示。
{language: [names ...]}
代码:
# -*- coding: utf-8 -*- from __future__ import unicode_literals, print_function, division from io import open import glob def findFiles(path): return glob.glob(path) print(findFiles(r'./data/data/names/*.txt'))#获得data/name下所有txt文件 import unicodedata import string all_letters = string.ascii_letters + " .,;'" n_letters = len(all_letters)#用于one hot 编码 # Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427 def unicodeToAscii(s): return ''.join( c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn' and c in all_letters ) print(unicodeToAscii('Ślusàrski'))
out:
['data/names/Russian.txt', 'data/names/Scottish.txt',
'data/names/Spanish.txt', 'data/names/Vietnamese.txt',
'data/names/Arabic.txt', 'data/names/Chinese.txt',
'data/names/Czech.txt', 'data/names/Dutch.txt',
'data/names/English.txt', 'data/names/French.txt',
'data/names/German.txt', 'data/names/Greek.txt',
'data/names/Irish.txt', 'data/names/Italian.txt',
'data/names/Japanese.txt', 'data/names/Korean.txt',
'data/names/Polish.txt', 'data/names/Portuguese.txt']
Slusarski
2:对数据更新字母层面上的独热编码:
# Build the category_lines dictionary, a list of names per language category_lines = {} all_categories = [] #{language: [names ...]} # Read a file and split into lines def readLines(filename): lines = open(filename, encoding='utf-8').read().strip().split('\n') return [unicodeToAscii(line) for line in lines] for filename in findFiles(r'./data/data/names/*.txt'): category = filename.split('/')[-1].split('.')[0] all_categories.append(category) lines = readLines(filename) category_lines[category] = lines n_categories = len(all_categories) print(n_categories) import torch # Find letter index from all_letters, e.g. "a" = 0 def letterToIndex(letter): return all_letters.find(letter) # Just for demonstration, turn a letter into a <1 x n_letters> Tensor def letterToTensor(letter): tensor = torch.zeros(1, n_letters) #n_letters = len(all_letters) tensor[0][letterToIndex(letter)] = 1 return tensor # Turn a line into a <line_length x 1 x n_letters>, # or an array of one-hot letter vectors def lineToTensor(line): tensor = torch.zeros(len(line), 1, n_letters) for li, letter in enumerate(line): tensor[li][0][letterToIndex(letter)] = 1 return tensor#转化为矩阵 print(letterToTensor('J')) print(lineToTensor('Jones').size())
Columns 0 to 12 0 0 0 0 0 0 0 0 0 0 0 0 0Columns 13 to 25 0 0 0 0 0 0 0 0 0 0 0 0 0Columns 26 to 38 0 0 0 0 0 0 0 0 0 1 0 0 0Columns 39 to 51 0 0 0 0 0 0 0 0 0 0 0 0 0Columns 52 to 56 0 0 0 0 0
[torch.FloatTensor of size 1x57]
torch.Size([5, 1, 57])
3:构建RNN网络,框架和之前的CNN网络类似,只不过需要具体到RNN内部结构的一些设计。
RNN结构:
import torch.nn as nn from torch.autograd import Variable class RNN(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(RNN, self).__init__() self.hidden_size = hidden_size self.i2h = nn.Linear(input_size + hidden_size, hidden_size) self.i2o = nn.Linear(input_size + hidden_size, output_size) self.softmax = nn.LogSoftmax(dim=1) def forward(self, input, hidden): combined = torch.cat((input, hidden), 1) hidden = self.i2h(combined) output = self.i2o(combined) output = self.softmax(output) return output, hidden def initHidden(self): return Variable(torch.zeros(1, self.hidden_size)) n_hidden = 128 rnn = RNN(n_letters, n_hidden, n_categories) input = Variable(lineToTensor('Albert')) hidden = Variable(torch.zeros(1, n_hidden)) output, next_hidden = rnn(input[0], hidden) print(output) #print(next_hidden)
out:
Variable containing:
Columns 0 to 9-2.9346 -2.9036 -2.9996 -2.8229 -2.9089
-2.7909 -2.8781 -2.8332 -2.8440 -2.8522
Columns 10 to 17-3.0306 -2.8079 -2.9677 -2.9351 -2.8750 -2.9376
-2.7807 -2.9693
[torch.FloatTensor of size 1x18]
4:训练网络
def categoryFromOutput(output): top_n, top_i = output.data.topk(1) # Tensor out of Variable with .data#topk类似与找最大值,并返回最大值的数值和索引 category_i = top_i[0][0] return all_categories[category_i], category_i print(categoryFromOutput(output))
上面的函数功能是计算出output中哪一个概率最大然后找出对应的category的索引值。
训练网络的步骤:
》生成输入和目标数据的tensor
》建立并且初始化隐藏层
》读取每个数据并且进行RNN结构之间的传递
》比较output和target
》反向传播更新梯度
》返回output和loss.
代码:
criterion = nn.NLLLoss() learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn def train(category_tensor, line_tensor): hidden = rnn.initHidden() rnn.zero_grad() for i in range(line_tensor.size()[0]): output, hidden = rnn(line_tensor[i], hidden) loss = criterion(output, category_tensor) loss.backward() # Add parameters' gradients to their values, multiplied by learning rate for p in rnn.parameters(): p.data.add_(-learning_rate, p.grad.data) return output, loss.data[0] import time import math n_iters = 100000 print_every = 5000 plot_every = 1000 # Keep track of losses for plotting current_loss = 0 all_losses = [] def timeSince(since): now = time.time() s = now - since m = math.floor(s / 60) s -= m * 60 return '%dm %ds' % (m, s) start = time.time() for iter in range(1, n_iters + 1): category, line, category_tensor, line_tensor = randomTrainingExample() output, loss = train(category_tensor, line_tensor) current_loss += loss # Print iter number, loss, name and guess if iter % print_every == 0: guess, guess_i = categoryFromOutput(output) correct = 'yes' if guess == category else 'no (%s)' % category print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct)) # Add current loss avg to list of losses if iter % plot_every == 0: all_losses.append(current_loss / plot_every) current_loss = 0
后面还有测试方面的代码:待续
以上是关于Pytorch:预测姓名的所属国家的主要内容,如果未能解决你的问题,请参考以下文章