Theano-Deep Learning Tutorials 笔记:Denoising Autoencoders (dA)

Posted slim1017

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Theano-Deep Learning Tutorials 笔记:Denoising Autoencoders (dA)相关的知识,希望对你有一定的参考价值。

教程地址:http://www.deeplearning.net/tutorial/dA.html#autoencoders

降噪的自编码器由[Vincent08]提出,首先先介绍自编码器。

 

Autoencoders

See section 4.6 of [Bengio09] for an overview of auto-encoders.

UFLDL中自编码介绍http://deeplearning.stanford.edu/wiki/index.php/%E8%87%AA%E7%BC%96%E7%A0%81%E7%AE%97%E6%B3%95%E4%B8%8E%E7%A8%80%E7%96%8F%E6%80%A7

 

自编码器输入为:,通过映射到隐层,隐层为 是非线性激活函数,如sigmoid。这步是encoder。

通过 来重构 x,使 z 尽量与输入 x 一样。这是decoder。

自编码器的输入假设为100维,隐层神经元为50维,输出层(重构)100维,50少于100维,相当于用更少的维数表示更高维的数据,这就可以看成一种数据压缩。

注意:这里不是W的转置,而是隐层到第3层的权重

下面介绍下tied weights:当我们设置时,就叫tied weights,这时3层的自编码器参数就只有, , ;如果不设置tied weightsW 就没关系,这是参数就是, ,,  。

有不有 tied weights 对结果影响并不是很大,这方面大家可以自己研究研究,我就觉得 tied weights 比较省内存点。

 

重构误差 的度量方法的选取取决于输入服从的具体分布,这里使用 squared error  

如果输入可以表示为 bit vectors or vectors of bit probabilities,可以使用 cross-entropy:

 

 

下面这段描述了自编码和PCA的联系:(当激活函数为线性且使用MSE度量误差时,自编码的隐层就表示了PCA中数据主要成分;若为非线性时,表达能力强于PCA)

The hope is that the code is a distributed representation that captures the coordinates along the main factors of variation in the data. This is similar to the way the projection on principal components would capture the main factors of variation in the data. Indeed, if there is one linear hidden layer (the code) and the mean squared error criterion is used to train the network, then the hidden units learn to project the input in the span of the first principal components of the data. If the hidden layer is non-linear, the auto-encoder behaves differently from PCA, with the ability to capture multi-modal aspects of the input distribution. The departure(区别?) from PCA becomes even more important when we consider stacking multiple encoders (and their corresponding decoders) when building a deep auto-encoder [Hinton06].

 

可以看作的有损压缩,在训练样本上重构误差特别小并不意味着在任意数据上误差小:当数据分布与样本一致时,误差就会比较小,但当数据是在输入空间中随机采样得到时,误差就会比较大。

 

实现 auto-encoder 类(后面的栈式自编码会用到),注意使用了tied weights,所以

 def __init__(
        self,
        numpy_rng,
        theano_rng=None,
        input=None,
        n_visible=784,
        n_hidden=500,
        W=None,
        bhid=None,
        bvis=None
    ):
        """
        Initialize the dA class by specifying the number of visible units (the
        dimension d of the input ), the number of hidden units ( the dimension
        d' of the latent or hidden space ) and the corruption level. The
        constructor also receives symbolic variables for the input, weights and
        bias. Such a symbolic variables are useful when, for example the input
        is the result of some computations, or when weights are shared between
        the dA and an MLP layer. When dealing with SdAs this always happens,
        the dA on layer 2 gets as input the output of the dA on layer 1,
        and the weights of the dA are used in the second stage of training
        to construct an MLP.

        :type numpy_rng: numpy.random.RandomState
        :param numpy_rng: number random generator used to generate weights

        :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams
        :param theano_rng: Theano random generator; if None is given one is
                     generated based on a seed drawn from `rng`

        :type input: theano.tensor.TensorType
        :param input: a symbolic description of the input or None for
                      standalone dA

        :type n_visible: int
        :param n_visible: number of visible units

        :type n_hidden: int
        :param n_hidden:  number of hidden units

        :type W: theano.tensor.TensorType
        :param W: Theano variable pointing to a set of weights that should be
                  shared belong the dA and another architecture; if dA should
                  be standalone set this to None

        :type bhid: theano.tensor.TensorType
        :param bhid: Theano variable pointing to a set of biases values (for
                     hidden units) that should be shared belong dA and another
                     architecture; if dA should be standalone set this to None

        :type bvis: theano.tensor.TensorType
        :param bvis: Theano variable pointing to a set of biases values (for
                     visible units) that should be shared belong dA and another
                     architecture; if dA should be standalone set this to None


        """
        self.n_visible = n_visible
        self.n_hidden = n_hidden

        # create a Theano random generator that gives symbolic random values
        if not theano_rng:
            theano_rng = RandomStreams(numpy_rng.randint(2 ** 30))

        # note : W' was written as `W_prime` and b' as `b_prime`
        if not W:
            # W is initialized with `initial_W` which is uniformely sampled
            # from -4*sqrt(6./(n_visible+n_hidden)) and
            # 4*sqrt(6./(n_hidden+n_visible))the output of uniform if
            # converted using asarray to dtype
            # theano.config.floatX so that the code is runable on GPU
            initial_W = numpy.asarray(
                numpy_rng.uniform(
                    low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)),
                    high=4 * numpy.sqrt(6. / (n_hidden + n_visible)),
                    size=(n_visible, n_hidden)
                ),
                dtype=theano.config.floatX
            )
            W = theano.shared(value=initial_W, name='W', borrow=True)

        if not bvis:
            bvis = theano.shared(
                value=numpy.zeros(
                    n_visible,
                    dtype=theano.config.floatX
                ),
                borrow=True
            )

        if not bhid:
            bhid = theano.shared(
                value=numpy.zeros(
                    n_hidden,
                    dtype=theano.config.floatX
                ),
                name='b',
                borrow=True
            )

        self.W = W
        # b corresponds to the bias of the hidden
        self.b = bhid
        # b_prime corresponds to the bias of the visible
        self.b_prime = bvis
        # tied weights, therefore W_prime is W transpose
        self.W_prime = self.W.T
        self.theano_rng = theano_rng
        # if no input is given, generate a variable representing the input
        if input is None:
            # we use a matrix because we expect a minibatch of several
            # examples, each example being a row
            self.x = T.dmatrix(name='input')
        else:
            self.x = input

        self.params = [self.W, self.b, self.b_prime]


隐藏层计算:

 def get_hidden_values(self, input):
        """ Computes the values of the hidden layer """
        return T.nnet.sigmoid(T.dot(input, self.W) + self.b)


重构层计算:

def get_reconstructed_input(self, hidden):
        """Computes the reconstructed input given the values of the
        hidden layer

        """
        return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime)


计算 cost 和 updates:

def get_cost_updates(self, corruption_level, learning_rate):
        """ This function computes the cost and the updates for one trainng
        step of the dA """

        tilde_x = self.get_corrupted_input(self.x, corruption_level)
        y = self.get_hidden_values(tilde_x)
        z = self.get_reconstructed_input(y)
        # note : we sum over the size of a datapoint; if we are using
        #        minibatches, L will be a vector, with one entry per
        #        example in minibatch
        L = - T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1)
        # note : L is now a vector, where each element is the
        #        cross-entropy cost of the reconstruction of the
        #        corresponding example of the minibatch. We need to
        #        compute the average of all these to get the cost of
        #        the minibatch
        cost = T.mean(L)

        # compute the gradients of the cost of the `dA` with respect
        # to its parameters
        gparams = T.grad(cost, self.params)
        # generate the list of updates
        updates = [
            (param, param - learning_rate * gparam)
            for param, gparam in zip(self.params, gparams)
        ]

        return (cost, updates)


迭代更新参数使重构损失最小:

da = dA(
        numpy_rng=rng,
        theano_rng=theano_rng,
        input=x,
        n_visible=28 * 28,
        n_hidden=500
    )

    cost, updates = da.get_cost_updates(
        corruption_level=0.,
        learning_rate=learning_rate
    )

    train_da = theano.function(
        [index],
        cost,
        updates=updates,
        givens=
            x: train_set_x[index * batch_size: (index + 1) * batch_size]
        
    )

    start_time = timeit.default_timer()

    ############
    # TRAINING #
    ############

    # go through training epochs
    for epoch in xrange(training_epochs):
        # go through trainng set
        c = []
        for batch_index in xrange(n_train_batches):
            c.append(train_da(batch_index))

        print 'Training epoch %d, cost ' % epoch, numpy.mean(c)

    end_time = timeit.default_timer()

    training_time = (end_time - start_time)

    print >> sys.stderr, ('The no corruption code for file ' +
                          os.path.split(__file__)[1] +
                          ' ran for %.2fm' % ((training_time) / 60.))
    image = Image.fromarray(
        tile_raster_images(X=da.W.get_value(borrow=True).T,
                           img_shape=(28, 28), tile_shape=(10, 10),
                           tile_spacing=(1, 1)))
    image.save('filters_corruption_0.png')

    # start-snippet-3
    #####################################
    # BUILDING THE MODEL CORRUPTION 30% #
    #####################################

    rng = numpy.random.RandomState(123)
    theano_rng = RandomStreams(rng.randint(2 ** 30))

    da = dA(
        numpy_rng=rng,
        theano_rng=theano_rng,
        input=x,
        n_visible=28 * 28,
        n_hidden=500
    )

    cost, updates = da.get_cost_updates(
        corruption_level=0.3,
        learning_rate=learning_rate
    )

    train_da = theano.function(
        [index],
        cost,
        updates=updates,
        givens=
            x: train_set_x[index * batch_size: (index + 1) * batch_size]
        
    )

    start_time = timeit.default_timer()

    ############
    # TRAINING #
    ############

    # go through training epochs
    for epoch in xrange(training_epochs):
        # go through trainng set
        c = []
        for batch_index in xrange(n_train_batches):
            c.append(train_da(batch_index))

        print 'Training epoch %d, cost ' % epoch, numpy.mean(c)

    end_time = timeit.default_timer()

    training_time = (end_time - start_time)

    print >> sys.stderr, ('The 30% corruption code for file ' +
                          os.path.split(__file__)[1] +
                          ' ran for %.2fm' % (training_time / 60.))
    # end-snippet-3

    # start-snippet-4
    image = Image.fromarray(tile_raster_images(
        X=da.W.get_value(borrow=True).T,
        img_shape=(28, 28), tile_shape=(10, 10),
        tile_spacing=(1, 1)))
    image.save('filters_corruption_30.png')
    # end-snippet-4

    os.chdir('../')


if __name__ == '__main__':
    test_dA()

 

对一段自编码的论述(挺值得体会的)

If there is no constraint besides minimizing the reconstruction error, one might expect an auto-encoder with inputs and an encoding of dimension (or greater) to learn the identity function, merely mapping an input to its copy. Such an autoencoder would not differentiate test examples (from the training distribution) from other input configurations.

如果隐层为n个(及以上)神经元,大家怀疑自编码不就学习了一个恒等函数吗?没有什么区分能力。

Surprisingly, experiments reported in [Bengio07] suggest that, in practice, when trained with stochastic gradient descent, non-linear auto-encoders with more hidden units than inputs (called overcomplete) yield useful representations. (Here, “useful” means that a network taking the encoding as input has low classification error.)

可是,实验表明,使用随机梯度下降及没线性激活函数及比输入更多的隐层神经元时,自编码能提高分类性能。

A simple explanation is that stochastic gradient descent with early stopping is similar to an L2 regularization of the parameters. To achieve perfect reconstruction of continuous inputs, a one-hidden layer auto-encoder with non-linear hidden units (exactly like in the above code) needs very small weights in the first (encoding) layer, to bring the non-linearity of the hidden units into their linear regime, and very large weights in the second (decoding) layer. With binary inputs, very large weights are also needed to completely minimize the reconstruction error. Since the implicit or explicit regularization makes it difficult to reach large-weight solutions, the optimization algorithm finds encodings which only work well for examples similar to those in the training set, which is what we want. It means that the representation is exploiting statistical regularities present in the training set, rather than merely learning to replicate the input.

直白的解释是:使用early stopping的随机梯度下降和 L2正则化类似。让权重很难变大,这里没太理解明白,有空再看看!

There are other ways by which an auto-encoder with more hidden units than inputs could be prevented from learning the identity function, capturing something useful about the input in its hidden representation. One is the addition of sparsity (forcing many of the hidden units to be zero or near-zero). Sparsity has been exploited very successfully by many [Ranzato07] [Lee08]. Another is to add randomness in the transformation from input to reconstruction. This technique is used in Restricted Boltzmann Machines (discussed later in Restricted Boltzmann Machines (RBM)), as well as in Denoising Auto-Encoders, discussed below.

其他让自编码提高描述数据本质能力的方法有

1.稀疏自编码,UFLDL中用的这个

2.add randomness in the transformation from input to reconstruction(添加随机噪声,用在了降噪自编码里)

 

Denoising Autoencoders

为了强迫自编码器发掘出数据中更鲁棒的特征,而不是学习到一个恒等函数,用被噪声污染的数据来训练自编码器。

降噪自编码做了两件事:

1.对输入数据encode,保存输入数据的信息;

2.由于输入数据被随机噪声污染,需要 try to undo the effect of a corruption process stochastically applied to the input of the auto-encoder。

第2点只有通过capturing the statistical dependencies between the inputs (发掘数据内在的统计特性,挖掘本质特征)才能实现。

下面一段话引出了一大堆东西,就只知道第一个流形学习,学无止境。。。:

The denoising auto-encoder can be understood from different perspectives (the manifold learning perspective, stochastic operator perspective, bottom-up – information theoretic perspective, top-down – generative model perspective), all of which are explained in [Vincent08]. See also section 7.2 of [Bengio09] for an overview of auto-encoders.

In [Vincent08], the stochastic corruption process randomly sets some of the inputs (as many as half of them) to zero.

Hence the denoising auto-encoder is trying to predict the corrupted (i.e. missing) values from the uncorrupted (i.e., non-missing) values, for randomly selected subsets of missing patterns. Note how being able to predict any subset of variables from the rest is a sufficient condition for completely capturing the joint distribution between a set of variables (this is how Gibbs sampling works).

 

对原先代码的修改:add a stochastic corruption step operating on the input,把某些输入随机置0,用了binomial(二项分布)。

def get_corrupted_input(self, input, corruption_level):
        """This function keeps ``1-corruption_level`` entries of the inputs the
        same and zero-out randomly selected subset of size ``coruption_level``
        Note : first argument of theano.rng.binomial is the shape(size) of
               random numbers that it should produce
               second argument is the number of trials
               third argument is the probability of success of any trial

                this will produce an array of 0s and 1s where 1 has a
                probability of 1 - ``corruption_level`` and 0 with
                ``corruption_level``

                The binomial function return int64 data type by
                default.  int64 multiplicated by the input
                type(floatX) always return float64.  To keep all data
                in floatX when floatX is float32, we set the dtype of
                the binomial to floatX. As in our case the value of
                the binomial is always 0 or 1, this don't change the
                result. This is needed to allow the gpu to work
                correctly as it only support float32 for now.

        """
        return self.theano_rng.binomial(size=input.shape, n=1,
                                        p=1 - corruption_level,
                                        dtype=theano.config.floatX) * input


In the stacked autoencoder class (Stacked Autoencoders) the weights of the dA class have to be shared with those of a corresponding sigmoid layer. For this reason, the constructor of the dA also gets Theano variables pointing to the shared parameters. If those parameters are left to None, new ones will be constructed.

 

Putting it All Together

完整代码见教程。

 

可以把训练好的权重W保存为图片出来看看效果:

To plot our filters we will need the help of tile_raster_images (see Plotting Samples and Filters) so we urge the reader to study it. Also using the help of the Python Image Library, the following lines of code will save the filters as an image :

image = Image.fromarray(tile_raster_images(
        X=da.W.get_value(borrow=True).T,
        img_shape=(28, 28), tile_shape=(10, 10),
        tile_spacing=(1, 1)))
    image.save('filters_corruption_30.png')

感觉结果不如UFLDL中的稀疏自编码好。

The resulted filters when we do not use any noise are :

The filters for 30 percent noise :


 

以上是关于Theano-Deep Learning Tutorials 笔记:Denoising Autoencoders (dA)的主要内容,如果未能解决你的问题,请参考以下文章

Theano-Deep Learning Tutorials 笔记:Stacked Denoising Autoencoders (SdA)

Theano-Deep Learning Tutorials 笔记:Getting Started

Theano-Deep Learning Tutorials 笔记:Multilayer Perceptron

Theano-Deep Learning Tutorials 笔记:Classifying MNIST digits using Logistic Regression

Theano-Deep Learning Tutorials 笔记:Convolutional Neural Networks (LeNet)

Theano-Deep Learning Tutorials 笔记:Recurrent Neural Networks with Word Embeddings