Theano-Deep Learning Tutorials 笔记:Getting Started
Posted slim1017
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Theano-Deep Learning Tutorials 笔记:Getting Started相关的知识,希望对你有一定的参考价值。
教程地址:http://www.deeplearning.net/tutorial/gettingstarted.html
Datasets
(1)mnist手写数字集:每张是一个784维向量(28*28),像素值为0到1的float,每张代表一个0到9的数,50000张training set,10000张validation set(验证集用于类似学习率,model size等参数的选择),10000张testing set。
For convenience we pickled the dataset to make it easier to use in python.
import cPickle, gzip, numpy
# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()
Note:cPickle包的功能和用法与pickle包几乎完全相同,cPickle用C码的,性能好很多。
(2)We encourage you to store the dataset into shared variablesand access it based on the minibatch index, given a fixed and known batch size(即代码中的batch_size =500).
原因是:使用GPU时,不停地把数据拷贝到GPU效率不高,尽量使用Theano shared variables来提高性能;建议设6个不同共享变量,data:training set,validation set ,testing set 3个,label 3个。
def shared_dataset(data_xy):
#Function that loads the dataset into shared variables
data_x, data_y = data_xy
shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
# GPU上数据存储为float,y应该是int,所以return的时候用cast转成int,
return shared_x, T.cast(shared_y, 'int32')
test_set_x, test_set_y = shared_dataset(test_set)
valid_set_x, valid_set_y = shared_dataset(valid_set)
train_set_x, train_set_y = shared_dataset(train_set)
batch_size = 500 # size of the minibatch
# accessing the third minibatch of the training set
data = train_set_x[2 * batch_size: 3 * batch_size]
label = train_set_y[2 * batch_size: 3 * batch_size]
如果出现内存溢出的情况:
you can store a sufficiently small chunk of your data (several minibatches) in a shared variable and use that during training. Once you got through the chunk, update the values it stores.
Learning a Classifier
Zero-One Loss
预测对的样本损失就是0,不对就是1,所有样本损失求和
If is the prediction function, then this loss can be written as:
where either is the training set (during training) or (to avoid biasing the evaluation of validation or test error). is the indicator function defined as:
In this tutorial, is defined as:
# zero_one_loss is a Theano variable representing a symbolic
# expression of the zero one loss ; to get the actual value this
# symbolic expression has to be compiled into a Theano function (see
# the Theano tutorial for more details)
zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y))
Negative Log-Likelihood Loss
原理类似最大似然估计。
minimize the negative log-likelihood (NLL), defined as:
# NLL is a symbolic variable ; to get the actual value of NLL, this symbolic
# expression has to be compiled into a Theano function (see the Theano
# tutorial for more details)
NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y])
# note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].
# Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the
# elements M[0,a], M[1,b], ..., M[K,k] as a vector. Here, we use this
# syntax to retrieve the log-probability of the correct labels, y.
Stochastic Gradient Descent
随机梯度下降是梯度下降的改进:
梯度下降求取所有样本损失的均值,每次迭代都对所有样本计算,计算量大,收敛慢;所以采用随机抽取小部分样本的方式(minibatch),每次计算minibatch的损失均值来调整参数。
minibatch的数量选择:选大了选小了都各有优劣。
An optimal is model-, dataset-, and hardware-dependent, and can be anywhere from 1 to maybe several hundreds. In the tutorial we set it to 20, but this choice is almost arbitrary (though harmless).
If you are training for a fixed number of epochs, the minibatch size becomes important because it controls the number of updates done to your parameters. Training the same model for 10 epochs using a batch size of 1 yields completely different results compared to training for the same 10 epochs but with a batchsize of 20.
# Minibatch Stochastic Gradient Descent
# assume loss is a symbolic description of the loss function given
# the symbolic variables params (shared variable), x_batch, y_batch;
# compute gradient of loss with respect to params
d_loss_wrt_params = T.grad(loss, params)
# compile the MSGD step into a theano function
updates = [(params, params - learning_rate * d_loss_wrt_params)]
MSGD = theano.function([x_batch,y_batch], loss, updates=updates)
for (x_batch, y_batch) in train_batches:
# here x_batch and y_batch are elements of train_batches and
# therefore numpy arrays; function MSGD also updates the params
print('Current loss is ', MSGD(x_batch, y_batch))
if stopping_condition_is_met:
return params
Regularization
机器学习中正则化随处可见,主要作用是防止过拟合。
直观的理解是:在损失函数中加入模型参数的范式,优化目标是使参数尽量小(接近0),这就是模型在原有基础上尽量简单,机器学习理论中,模型尽量简单就更不容易过拟合。(并不能一味追求简单,简单的模型并不一定泛化能力(generalization)就好)
L1 and L2 regularization
就是在损失函数后面加参数向量的1范数和2范数。
Formally, if our loss function is:
then the regularized loss will be:
or, in our case
where
p为1,2
正则化的详细介绍:
In principle, adding a regularization term to the loss will encourage smooth network mappings in a neural network (by penalizing large values of the parameters, which decreases the amount of nonlinearity that the network models). More intuitively, the two terms (NLL and ) correspond to modelling the data well (NLL) and having “simple” or “smooth” solutions (). Thus, minimizing the sum of both will, in theory, correspond to finding the right trade-off (即折衷考虑)between the fit to the training data and the “generality” of the solution that is found. To follow Occam’s razor principle, this minimization should find us the simplest solution (as measured by our simplicity criterion) that fits the training data.
Note that the fact that a solution is “simple” does not mean that it will generalize well. Empirically, it was found that performing such regularization in the context of neural networks helps with generalization, especially on small datasets. The code block below shows how to compute the loss in python when it contains both a L1 regularization term weighted by and L2 regularization term weighted by
# symbolic Theano variable that represents the L1 regularization term
L1 = T.sum(abs(param))
# symbolic Theano variable that represents the squared L2 term
L2 = T.sum(param ** 2)
# the loss
loss = NLL + lambda_1 * L1 + lambda_2 * L2
Early-Stopping
Early-stopping通过测试模型在validation set的性能来防止过拟合。即当性能在测试集上不再显著提高甚至下降时,就停止优化迭代。
The choice of when to stop is a judgement call and a few heuristics(启发式) exist, but these tutorials will make use of a strategy based on a geometrically increasing amount of patience.(模拟一种耐心程度来决定何时停止)
# early-stopping parameters
patience = 5000 # look as this many examples regardless
patience_increase = 2 # wait this much longer when a new best is
# found
improvement_threshold = 0.995 # a relative improvement of this much is
# considered significant
validation_frequency = min(n_train_batches, patience/2)
# go through this many
# minibatches before checking the network
# on the validation set; in this case we
# check every epoch 因为n_train_batches比patience/2小,每n_train_batches验证一次就是每epoch验证一次
best_params = None
best_validation_loss = numpy.inf
test_score = 0.
start_time = time.clock()
done_looping = False
epoch = 0
while (epoch < n_epochs) and (not done_looping):
# Report "1" for first epoch, "n_epochs" for last epoch
epoch = epoch + 1
for minibatch_index in xrange(n_train_batches):
d_loss_wrt_params = ... # compute gradient
params -= learning_rate * d_loss_wrt_params # gradient descent
# iteration number. We want it to start at 0.
iter = (epoch - 1) * n_train_batches + minibatch_index
# note that if we do `iter % validation_frequency` it will be
# true for iter = 0 which we do not want. We want it true for
# iter = validation_frequency - 1.
if (iter + 1) % validation_frequency == 0:
this_validation_loss = ... # compute zero-one loss on validation set
if this_validation_loss < best_validation_loss:
# improve patience if loss improvement is good enough
if this_validation_loss < best_validation_loss * improvement_threshold:
patience = max(patience, iter * patience_increase)
best_params = copy.deepcopy(params)
best_validation_loss = this_validation_loss
if patience <= iter:
done_looping = True
break
# POSTCONDITION:
# best_params refers to the best out-of-sample parameters observed during the optimization
If we run out of batches of training data before running out of patience, then we just go back to the beginning of the training set and repeat.
代码过程是:
(1)不停地更新参数,iter不停在涨
(2)每隔validation_frequency这么多次,就验证一下
(3)如果在验证集上的损失有明显下降且iter * patience_increase>patience,patience就增长:patience = max(patience, iter * patience_increase) 注意patience_increase为2,iter越大,patience增长越多。
(4)iter,patience各自都在涨,当iter>=patience就停止了。
Note:validation_frequency = min(n_train_batches, patience/2)
这句代码保证了,无论什么情况下,都能验证2次及以上:假设patience不增长,在iter=patience/2时可以验证一次,在iter=patience时又可以验证一记,所以至少两次。
Note:This algorithm could possibly be improved by using a test of statistical significance rather than the simple comparison, when deciding whether to increase the patience.
Theano/Python Tips
Loading and Saving Models
训练,测试了半天,需要把得到的最佳参数储存下来,matlab非常容易储存,python则使用cPickle
Read more about serialization in Theano, or Python’s pickling.
Pickle the numpy ndarrays from your shared variables
if your parameters are in shared variables w, v, u
, then your save command should look something like:
import cPickle
save_file = open('path', 'wb') # this will overwrite current contents
Pickle.dump(w.get_value(borrow=True), save_file, -1) # the -1 is for HIGHEST_PROTOCOL
cPickle.dump(v.get_value(borrow=True), save_file, -1) # .. and it triggers much more efficient
cPickle.dump(u.get_value(borrow=True), save_file, -1) # .. storage than numpy's default
save_file.close()
Then later, you can load your data back like this:
save_file = open('path')
w.set_value(cPickle.load(save_file), borrow=True)
v.set_value(cPickle.load(save_file), borrow=True)
u.set_value(cPickle.load(save_file), borrow=True)
Do not pickle your training or test functions for long-term storage
Theano functions are compatible with Python’s deepcopy and pickle mechanisms, but you should not necessarily pickle a Theano function. If you update your Theano folder and one of the internal changes, then you may not be able to un-pickle your model.
Plotting Intermediate Results
用PIL, matplotlib两个库实现可视化。
以上是关于Theano-Deep Learning Tutorials 笔记:Getting Started的主要内容,如果未能解决你的问题,请参考以下文章
Theano-Deep Learning Tutorials 笔记:Stacked Denoising Autoencoders (SdA)
Theano-Deep Learning Tutorials 笔记:Getting Started
Theano-Deep Learning Tutorials 笔记:Multilayer Perceptron
Theano-Deep Learning Tutorials 笔记:Classifying MNIST digits using Logistic Regression
Theano-Deep Learning Tutorials 笔记:Convolutional Neural Networks (LeNet)
Theano-Deep Learning Tutorials 笔记:Recurrent Neural Networks with Word Embeddings