6.4 PyTorch实现Skipgram模型

Posted 2023-02-14 王小小小草

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了6.4 PyTorch实现Skipgram模型相关的知识，希望对你有一定的参考价值。

欢迎订阅本专栏：《PyTorch深度学习实践》
订阅地址：https://blog.csdn.net/sinat_33761963/category_9720080.html

第二章：认识Tensor的类型、创建、存储、api等，打好Tensor的基础，是进行PyTorch深度学习实践的重中之重的基础。
第三章：学习PyTorch如何读入各种外部数据
第四章：利用PyTorch从头到尾创建、训练、评估一个模型，理解与熟悉PyTorch实现模型的每个步骤，用到的模块与方法。
第五章：学习如何利用PyTorch提供的3种方法去创建各种模型结构。
第六章：利用PyTorch实现简单与经典的模型全过程:简单二分类、手写字体识别、词向量的实现、自编码器实现。
第七章：利用PyTorch实现复杂模型：翻译机（nlp领域）、生成对抗网络（GAN)、强化学习(RL)、风格迁移（cv领域）。
第八章：PyTorch的其他高级用法：模型在不同框架之间的迁移、可视化、多个GPU并行计算。

skipgram是很经典的词向量模型，在大量的语料上进行训练，得到每个词的嵌入向量。

相关论文：

“Distributed Representations of Sentences and Documents”
“Efficient estimation of word representations in vector space”

相关博客：

网上搜索“word2vec","skipgram"一搜一大把
对词向量的系统了解可以参考我的博客：https://blog.csdn.net/sinat_33761963/article/details/53521149

现在来看看PyTorch如何实现Skipgram并训练后获取词向量。

6.4.1 准备数据

准备语料：一般情况下语料会存储在文本文件中，一句话为一行，用python都进来后存放在一个list中，每个元素是文本中的一行。接着对每行为本做分词，英文可以直接按照空格分，中文就需要用到分词器了，分割后就得到了以下代码中的corpus_list。

构建词典：这一步不可或缺，我们要将每个词都转换成数字代表的索引，方便模型识别，而模型输出的索引，也需要再转变为文字，方便人查看。因此，需要建立两个dict，一个是索引：词，一个是词：索引，即代码中的ix2word, word2ix。

构建训练对：语料一方面用于构建词典，另一方面需要预处理成模型可以读入的训练对(x,y), skipgram是输入中心词，预测上下文词，因此其数据对应为（center_word, contenx_word）,且需要转换成索引形式（center_word_ix, contenx_word_ix）

import torch
import torch.nn as nn
import torch.nn.functional as F

# 准备语料
corpus = ['he is a king',
          'she is a queen',
          'he is a man',
          'she is a woman',
          'warsaw is poland capital',
          'berlin is germany capital',
          'paris is france capital']
corpus_list = [sentence.split() for sentence in corpus]


# 构建词典
word2ix = 
for sentence in corpus:
    for word in sentence.split():
        if word not in word2ix:
            word2ix[word] = len(word2ix)  # 为每个词都匹配一个索引
ix2word = v:k for k, v in word2ix.items()  # 将dict中的key与value互换位置
voc_size = len(word2ix)


# 构建训练对
WINDOWS = 2  # 取左右窗口的词作为context_word
pairs = []  # 存放训练对

for sentence in corpus_list:
    for center_word_index in range(len(sentence)):
        center_word_ix = word2ix[sentence[center_word_index]]
        for win in range(-WINDOWS, WINDOWS+1):
            contenx_word_index = center_word_index + win
            if 0 <= contenx_word_index <= len(sentence)-1 and contenx_word_index != center_word_index:
                context_word_ix = word2ix[sentence[contenx_word_index]]
                pairs.append((center_word_ix, context_word_ix))

6.4.2 构建SkipGram网络结构

x: 输入的x是一个大小为voc_dim的one-hot向量[0,0,…,1,…0]

嵌入矩阵：构建一个嵌入矩阵，它的大小是（emb_dim, voc_dim），emb_dim是词向量的维度, voc_dim是词典的大小。这个矩阵中的值是我们最终要求取的数据，因此它是参数，可以用nn.Parameter()来创建这个矩阵参数。
该矩阵乘以x会得到一个emb_dim大小的向量，就是forward函数中的变量emb

线性计算：参数W乘以emb，是一个线性计算的过程，输出voc_dim大小的向量

softmax计算：线性计算的输出再经过一个softmax后，输出大小与输入保持一致，但向量中的值变成了0-1的概率，即得到了词典中所有词作为输出x的下文词的概率。

前向计算结束。

注意：torch.nn.init.xavier_normal是初始化参数的一种方式，以避免参数过大或过小而阻碍正常训练。

class SkipGram(nn.Module):
    def __init__(self, voc_dim, emb_dim):
        super(SkipGram, self).__init__()
        # 初始化参数
        self.embedding_matrix = nn.Parameter(torch.FloatTensor(emb_dim, voc_dim))
        self.W = nn.Parameter(torch.FloatTensor(voc_dim, emb_dim))
        torch.nn.init.xavier_normal(self.embedding_matrix)
        torch.nn.init.xavier_normal(self.W)

    def forward(self, x):
        emb = torch.matmul(self.embedding_matrix, x)
        h = torch.matmul(self.W, emb)  # [voc_dim]
        log_softmax = F. log_softmax(h)  # [voc_dim]

        return log_softmax

6.4.3开始训练

还是老套路。

注意，这是一个为了演示的小例子，所以epoch, embedding_dim都设置的很小，在实际训练中要根据实际效果去设置，embedding_dim一般可以设置为100.

# 提前设置超参数
epoch = 10
lr = 1e-2
embedding_dim = 5

# 模型、优化器、损失
model = SkipGram(voc_size, embedding_dim)
optim = torch.optim.Adam(model.parameters(), lr=lr)
loss_f = torch.nn.NLLLoss()  

# 这是将索引变成词典大小的One-Hot向量的方法
def get_onehot_vector(ix):
    one_hot_vec = torch.zeros(voc_size).float()
    one_hot_vec[ix] = 1.0
    return one_hot_vec

# 迭代
for e in range(epoch):
    epoch_loss = 0

    for i, (center_ix, context_ix) in enumerate(pairs):
        optim.zero_grad()

        # 预处理好数据结构
        one_hot_vec = get_onehot_vector(center_ix)
        y_true = torch.Tensor([context_ix]).long()

        # 前向
        y_pred = model(one_hot_vec)
        loss = loss_f(y_pred.view(1, -1), y_true)

        # 后向
        loss.backward()
        epoch_loss += loss.data.item()

        # 梯度更新
        optim.step()

    if e % 2 == 0:
        print('epoch: %d, loss: %f' % (e, epoch_loss))

C:\\Users\\CC\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:7: UserWarning: nn.init.xavier_normal is now deprecated in favor of nn.init.xavier_normal_.
  import sys
C:\\Users\\CC\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:8: UserWarning: nn.init.xavier_normal is now deprecated in favor of nn.init.xavier_normal_.
  
C:\\Users\\CC\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:13: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  del sys.path[0]


epoch: 0, loss: 190.530474
epoch: 2, loss: 178.796980
epoch: 4, loss: 153.809181
epoch: 6, loss: 134.125972
epoch: 8, loss: 126.447176

6.4.4 预测

以上训练结束，我们得到了嵌入矩阵model.embedding_matrix，它的每一列代表一个词的嵌入向量。可以通过以下方式获取。

# # 3.预测：预测单词的向量并计算相似度
v1 = torch.matmul(model.embedding_matrix, get_onehot_vector((word2ix['he'])))
v2 = torch.matmul(model.embedding_matrix, get_onehot_vector((word2ix['she'])))
v3 = torch.matmul(model.embedding_matrix, get_onehot_vector((word2ix['capital'])))

print(v1)
print(v2)
print(v3)

s_v1_v2 = F.cosine_similarity(v1, v2, dim=0)
s_v1_v3 = F.cosine_similarity(v1, v3, dim=0)
print(s_v1_v2)
print(s_v1_v3)

tensor([ 0.7496, -1.2529, -1.1052,  0.3301, -0.9289], grad_fn=<MvBackward>)
tensor([ 0.2016, -1.9385, -0.7472,  0.0589, -0.8677], grad_fn=<MvBackward>)
tensor([-1.1139, -0.1073,  0.2193,  1.3546, -0.8456], grad_fn=<MvBackward>)
tensor(0.8998, grad_fn=<DivBackward1>)
tensor(0.0710, grad_fn=<DivBackward1>)

以上是关于6.4 PyTorch实现Skipgram模型的主要内容，如果未能解决你的问题，请参考以下文章