theano学习指南--词向量的循环神经网络(翻译)

Posted 2020-08-09 蓝色荣誉

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了theano学习指南--词向量的循环神经网络(翻译)相关的知识，希望对你有一定的参考价值。

欢迎fork我的github：https://github.com/zhaoyu611/DeepLearningTutorialForChinese

最近在学习Git，所以正好趁这个机会，把学习到的知识实践一下~ 看完DeepLearning的原理，有了大体的了解，但是对于theano的代码，还是自己撸一遍印象更深所以照着deeplearning.net上的代码，重新写了一遍，注释部分是原文翻译和自己的理解。感兴趣的小伙伴可以一起完成这个工作哦~ 有问题欢迎联系我 Email: [email protected] QQ: 3062984605

概述

本教程中，你将会学到：

词向量
循环神经网络架构
文本窗口

从而实现Semantic Parsing / Slot-Filling(自然语言的理解)。

代码—引用—联系方式

代码

实验代码见github repository。

论文

如果使用本教程，请引用下列文献：

Grégoire Mesnil, Xiaodong He, Li Deng and Yoshua Bengio. Investigation of Recurrent-Neural-Network Architectures and Learning Methods for Spoken Language Understanding. Interspeech, 2013.
Gokhan Tur, Dilek Hakkani-Tur and Larry Heck. What is left to be understood in ATIS?
Christian Raymond and Giuseppe Riccardi. Generative and discriminative algorithms for spoken language understanding. Interspeech, 2007.
Christian Raymond and Giuseppe Riccardi. Generative and discriminative algorithms for spoken language understanding. Interspeech, 2007.
Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010.

谢谢！

联系方式

有问题请联系 Grégoire Mesnil (first-add-a-dot-last-add-at-gmail-add-a-dot-com)。我们很乐意收到您的反馈。

任务

Slot-Filling (Spoken Language Understanding)是对给定的句子中每个单词标定标签。这是一个分类问题。

数据集

数据集是DARPA的一个小型数据集：ATIS (Airline Travel Information System)。这里的语句例子使用Inside Outside Beginning (IOB)表示。

input(words)	show	flights	from	Boston	to	New	York	today
Output(labels)	O	O	O	B-dept	O	B-arr	I-arr	B-date

ATIS 包含单词4978个，句子893个，测试集合包含单词56590个，句子9198个（平均句子长度为15）。类的数量(不同的slots)为128，其中包括O标签(NULL)。
在论文 Microsoft Research people，对于只出现一次的单词，标记为，运用同样的方法标记未出现的单词。在论文Ronan Collobert and colleagues中，用数字表示字符串，例如1984表示DIGITDIGITDIGITDIGIT。
我们将训练集合分为训练集和验证集，分别包含80%和20%的训练语句。 Significant performance improvement difference has to be greater than 0.6% in F1 measure at the 95% level due to the small size of the dataset。为了验证效果，实验中定义了三个矩阵：

这里使用conlleval文本验证模型效果。

循环神经网络模型

原始输入编码

一个token对应一个单词。ATIS中词汇表对应的每个token都有相应的索引。每个语句是索引的数组(int32)。其次，每个集合（训练集、验证集、测试集）是索引数组的列表。定义python字典将索引映射到单词。

>>> sentence
array([383, 189,  13, 193, 208, 307, 195, 502, 260, 539,
        7,  60,  72, 8, 350, 384], dtype=int32)
>>> map(lambda x: index2word[x], sentence)
[‘please‘, ‘find‘, ‘a‘, ‘flight‘, ‘from‘, ‘miami‘, ‘florida‘,
        ‘to‘, ‘las‘, ‘vegas‘, ‘<UNK>‘, ‘arriving‘, ‘before‘, ‘DIGIT‘, "o‘clock", ‘pm‘]

对于标签，采用同样的方法：

>>> labels
array([126, 126, 126, 126, 126,  48,  50, 126,  78, 123,  81, 126,  15,
        14,  89,  89], dtype=int32)
>>> map(lambda x: index2label[x], labels)
[‘O‘, ‘O‘, ‘O‘, ‘O‘, ‘O‘, ‘B-fromloc.city_name‘, ‘B-fromloc.state_name‘,
        ‘O‘, ‘B-toloc.city_name‘, ‘I-toloc.city_name‘, ‘B-toloc.state_name‘,
        ‘O‘, ‘B-arrive_time.time_relative‘, ‘B-arrive_time.time‘,
        ‘I-arrive_time.time‘, ‘I-arrive_time.time‘]

文本窗

给定语句：索引的数组，窗口大小：1,3,5,…。现在需要将语句中每个词根据文本窗选定该词周围的词。具体实现如下：

def contextwin(l, win):
    ‘‘‘
    win :: int corresponding to the size of the window
    given a list of indexes composing a sentence

    l :: array containing the word indexes

    it will return a list of list of indexes corresponding
    to context windows surrounding each word in the sentence
    ‘‘‘
    assert (win % 2) == 1
    assert win >= 1
    l = list(l)

    lpadded = win // 2 * [-1] + l + win // 2 * [-1]
    out = [lpadded[i:(i + win)] for i in range(len(l))]

    assert len(out) == len(l)
    return out

PADDING索引中的-1插在语句的开始/结束位置。
例子如下：

>>> x
array([0, 1, 2, 3, 4], dtype=int32)
>>> contextwin(x, 3)
[[-1, 0, 1],
 [ 0, 1, 2],
 [ 1, 2, 3],
 [ 2, 3, 4],
 [ 3, 4,-1]]
>>> contextwin(x, 7)
[[-1, -1, -1, 0, 1, 2, 3],
 [-1, -1,  0, 1, 2, 3, 4],
 [-1,  0,  1, 2, 3, 4,-1],
 [ 0,  1,  2, 3, 4,-1,-1],
 [ 1,  2,  3, 4,-1,-1,-1]]

总的来说，输入为一个索引的数组，输出为索引的矩阵。每行是指定单词的文本窗。

词向量

将语句转换成文本窗：索引的矩阵，下一步需要将索引转换为词向量。使用Theano。代码如下：

import theano, numpy
from theano import tensor as T

# nv :: size of our vocabulary
# de :: dimension of the embedding space
# cs :: context window size
nv, de, cs = 1000, 50, 5

embeddings = theano.shared(0.2 * numpy.random.uniform(-1.0, 1.0,     (nv+1, de)).astype(theano.config.floatX)) # add one for PADDING at the end

idxs = T.imatrix() # as many columns as words in the context window and as many lines as words in the sentence
x    = self.emb[idxs].reshape((idxs.shape[0], de*cs))

符号变量x表示矩阵的维度(语句中单词数量，文本窗的长度)。
下面开始编译theano函数：

>>> sample
array([0, 1, 2, 3, 4], dtype=int32)
>>> csample = contextwin(sample, 7)
[[-1, -1, -1, 0, 1, 2, 3],
 [-1, -1,  0, 1, 2, 3, 4],
 [-1,  0,  1, 2, 3, 4,-1],
 [ 0,  1,  2, 3, 4,-1,-1],
 [ 1,  2,  3, 4,-1,-1,-1]]
>>> f = theano.function(inputs=[idxs], outputs=x)
>>> f(csample)
array([[-0.08088442,  0.08458307,  0.05064092, ...,  0.06876887,
        -0.06648078, -0.15192257],
       [-0.08088442,  0.08458307,  0.05064092, ...,  0.11192625,
         0.08745284,  0.04381778],
       [-0.08088442,  0.08458307,  0.05064092, ..., -0.00937143,
         0.10804889,  0.1247109 ],
       [ 0.11038255, -0.10563177, -0.18760249, ..., -0.00937143,
         0.10804889,  0.1247109 ],
       [ 0.18738101,  0.14727569, -0.069544  , ..., -0.00937143,
         0.10804889,  0.1247109 ]], dtype=float32)
>>> f(csample).shape
(5, 350)

我们现在得到了文本窗词向量的一个序列(长度为5，表示语句长度)，该词向量非常适用循环神经网络。

Elman循环神经网络

Elman循环神经网络(E-RNN)的输入为当前输入（t时刻）和之前隐层状态（t-1时刻）。然后重复该步骤。
在之前章节中，我们构造输入为时序结构。在上述矩阵中，第0行表示t=0时刻，第1行表示t=1时刻，如此等等。
E-RNN中需要学习的参数如下：

词向量（真实值矩阵）
初始隐藏状态（真实值矢量）
作用于线性过程的t时刻输入和t-1时刻隐层状态的两个矩阵
（优化）偏置。建议：不使用
顶层的softmax分类器

整个网络的超参数如下：

词向量的维度
词汇表的数量
隐层单元的数量
类的数量
用于初始化模型的随机种子

代码如下：

class RNNSLU(object):
    ‘‘‘ elman neural net model ‘‘‘
    def __init__(self, nh, nc, ne, de, cs):
        ‘‘‘
        nh :: dimension of the hidden layer
        nc :: number of classes
        ne :: number of word embeddings in the vocabulary
        de :: dimension of the word embeddings
        cs :: word window context size
        ‘‘‘
        # parameters of the model
        self.emb = theano.shared(name=‘embeddings‘,
                                 value=0.2 * numpy.random.uniform(-1.0, 1.0,
                                 (ne+1, de))
                                 # add one for padding at the end
                                 .astype(theano.config.floatX))
        self.wx = theano.shared(name=‘wx‘,
                                value=0.2 * numpy.random.uniform(-1.0, 1.0,
                                (de * cs, nh))
                                .astype(theano.config.floatX))
        self.wh = theano.shared(name=‘wh‘,
                                value=0.2 * numpy.random.uniform(-1.0, 1.0,
                                (nh, nh))
                                .astype(theano.config.floatX))
        self.w = theano.shared(name=‘w‘,
                               value=0.2 * numpy.random.uniform(-1.0, 1.0,
                               (nh, nc))
                               .astype(theano.config.floatX))
        self.bh = theano.shared(name=‘bh‘,
                                value=numpy.zeros(nh,
                                dtype=theano.config.floatX))
        self.b = theano.shared(name=‘b‘,
                               value=numpy.zeros(nc,
                               dtype=theano.config.floatX))
        self.h0 = theano.shared(name=‘h0‘,
                                value=numpy.zeros(nh,
                                dtype=theano.config.floatX))

        # bundle
        self.params = [self.emb, self.wx, self.wh, self.w,
                       self.bh, self.b, self.h0]

以下代码构造词矩阵的输入：

 idxs = T.imatrix()
        x = self.emb[idxs].reshape((idxs.shape[0], de*cs))
        y_sentence = T.ivector(‘y_sentence‘)  # labels

调用scan函数实现递归，效果很神奇：

def recurrence(x_t, h_tm1):
            h_t = T.nnet.sigmoid(T.dot(x_t, self.wx)
                                 + T.dot(h_tm1, self.wh) + self.bh)
            s_t = T.nnet.softmax(T.dot(h_t, self.w) + self.b)
            return [h_t, s_t]

        [h, s], _ = theano.scan(fn=recurrence,
                                sequences=x,
                                outputs_info=[self.h0, None],
                                n_steps=x.shape[0])

        p_y_given_x_sentence = s[:, 0, :]
        y_pred = T.argmax(p_y_given_x_sentence, axis=1)

Theano会自动的计算所有梯度用于最大最小化似然概率：

lr = T.scalar(‘lr‘)

sentence_nll = -T.mean(T.log(p_y_given_x_sentence)
                               [T.arange(x.shape[0]), y_sentence])
sentence_gradients = T.grad(sentence_nll, self.params)
sentence_updates = OrderedDict((p, p - lr*g)
                                       for p, g in
                                       zip(self.params, sentence_gradients))

然后编译函数：

self.classify = theano.function(inputs=[idxs], outputs=y_pred)
self.sentence_train = theano.function(inputs=[idxs, y_sentence, lr],
                                              outputs=sentence_nll,
                                              updates=sentence_updates)

在每次更新之后，需要将词向量正则化：

        self.normalize = theano.function(inputs=[],
                                         updates={self.emb:
                                                  self.emb /
                                                  T.sqrt((self.emb**2)
                                                  .sum(axis=1))
                                                  .dimshuffle(0, ‘x‘)})

这就是所有的工作！

评估

根据之前定义的函数，你可以比较预测标签和真实标签，并计算相关矩阵。在这个github仓库，封装了conlleval文本。计算关于Inside Outside Beginning (IOB)的矩阵是十分必要的。如果词起始、词中间、词末端预测都是正确的，那么就认为该预测是正确的。需要注意的是，文本后缀是txt，而计算过程中需要将其转换为pl。

训练

更新

对于随机梯度下降法(SGD)的更新，我们将整句作为一个mini-batch，并对每句执行一次更新。对于纯SGD(不同于mini-batch)，每个单词执行一次更新。
每次循环/更新之后，需要正则化词向量，保证它们有统一的单位。

停止引用

在验证集上提前结束是一种常规技术：训练集运行一定的代数，每代在验证集上计算F1得分，并保留最好的模型。

超参数选择

尽管已经有关于超参数选择的研究/代码,这里我们使用KISS随机搜索。
以下参数是一些建议值：

学习率：uniform([0.05，0.01])
窗口大小：集合{3,…,19}的随机数
隐层单元数量:{100,200}之间的随机数
词向量维度：{50,100}之间的随机数

运行程序

使用download.sh命令下载数据文件后，可以调用以下命令运行程序：

python code/rnnslu.py

(‘NEW BEST: epoch‘, 25, ‘valid F1‘, 96.84, ‘best test F1‘, 93.79)
[learning] epoch 26 >> 100.00% completed in 28.76 (sec) <<
[learning] epoch 27 >> 100.00% completed in 28.76 (sec) <<
...
(‘BEST RESULT: epoch‘, 57, ‘valid F1‘, 97.23, ‘best test F1‘, 94.2, ‘with the model‘, ‘rnnslu‘)

时间

使用github仓库中的代码测试ATIS数据集，每代少于40秒。实验平台为：n i7 CPU 950 @ 3.07GHz using less than 200 Mo of RAM。

[learning] epoch 0 >> 100.00% completed in 34.48 (sec) <<

进行若干代之后，F1得分下降为94.48% 。

NEW BEST: epoch 28 valid F1 96.61 best test F1 94.19
NEW BEST: epoch 29 valid F1 96.63 best test F1 94.42
[learning] epoch 30 >> 100.00% completed in 35.04 (sec) <<
[learning] epoch 31 >> 100.00% completed in 34.80 (sec) <<
[...]
NEW BEST: epoch 40 valid F1 97.25 best test F1 94.34
[learning] epoch 41 >> 100.00% completed in 35.18 (sec) <<
NEW BEST: epoch 42 valid F1 97.33 best test F1 94.48
[learning] epoch 43 >> 100.00% completed in 35.39 (sec) <<
[learning] epoch 44 >> 100.00% completed in 35.31 (sec) <<
[...]

词向量近邻

我们可以对学习到的词向量进行K近邻检查。L2距离和cos距离返回结果相同，所以我们画出词向量的cos距离。

atlanta	back	ap80	but	aircraft	business	a	august	actually	cheap
phoenix	live	ap57	if	plane	coach	people	september	provide	weekday
denver	lives	ap	up	service	first	do	january	prices	weekdays
tacoma	both	connections	a	airplane	fourth	but	june	stop	am
columbus	how	tomorrow	now	seating	thrift	numbers	december	number	early
seattle	me	before	amount	stand	tenth	abbreviation	november	flight	sfo
minneapolis	out	earliest	more	that	second	if	april	there	milwaukee
pittsburgh	other	connect	abbreviation	on	fifth	up	july	serving	jfk
ontario	plane	thrift	restrictions	turboprop	third	serve	jfk	thank	shortest
montreal	service	coach	mean	mean	twelfth	database	october	ticket	bwi
philadelphia	fare	today	interested	amount	sixth	passengers	may	are	lastest

可以看出，较少的词汇表（大约500单词）可以较少计算量。根据人为识别，发现有些分类效果好，有些则较差。

以上是关于theano学习指南--词向量的循环神经网络(翻译)的主要内容，如果未能解决你的问题，请参考以下文章