.. _SdA: Stacked Denoising Autoencoders (SdA) ==================================== .. note:: This section assumes you have already read through :doc:`logreg` and :doc:`mlp`. Additionally it uses the following Theano functions and concepts : `T.tanh`_, `shared variables`_, `basic arithmetic ops`_, `T.grad`_, `Random numbers`_, `floatX`_. If you intend to run the code on GPU also read `GPU`_. .. _T.tanh: http://deeplearning.net/software/theano/tutorial/examples.html?highlight=tanh .. _shared variables: http://deeplearning.net/software/theano/tutorial/examples.html#using-shared-variables .. _basic arithmetic ops: http://deeplearning.net/software/theano/tutorial/adding.html#adding-two-scalars .. _T.grad: http://deeplearning.net/software/theano/tutorial/examples.html#computing-gradients .. _floatX: http://deeplearning.net/software/theano/library/config.html#config.floatX .. _GPU: http://deeplearning.net/software/theano/tutorial/using_gpu.html .. _Random numbers: http://deeplearning.net/software/theano/tutorial/examples.html#using-random-numbers .. note:: The code for this section is available for download `here`_. .. _here: http://deeplearning.net/tutorial/code/SdA.py The Stacked Denoising Autoencoder (SdA) is an extension of the stacked autoencoder [Bengio07]_ and it was introduced in [Vincent08]_. This tutorial builds on the previous tutorial :ref:`dA`. Especially if you do not have experience with autoencoders, we recommend reading it before going any further. .. _stacked_autoencoders: Stacked Autoencoders ++++++++++++++++++++ Denoising autoencoders can be stacked to form a deep network by feeding the latent representation (output code) of the denoising autoencoder found on the layer below as input to the current layer. The **unsupervised pre-training** of such an architecture is done one layer at a time. Each layer is trained as a denoising autoencoder by minimizing the error in reconstructing its input (which is the output code of the previous layer). Once the first :math:`k` layers are trained, we can train the :math:`k+1`-th layer because we can now compute the code or latent representation from the layer below. Once all layers are pre-trained, the network goes through a second stage of training called **fine-tuning**. Here we consider **supervised fine-tuning** where we want to minimize prediction error on a supervised task. For this, we first add a logistic regression layer on top of the network (more precisely on the output code of the output layer). We then train the entire network as we would train a multilayer perceptron. At this point, we only consider the encoding parts of each auto-encoder. This stage is supervised, since now we use the target class during training. (See the :ref:`mlp` for details on the multilayer perceptron.) This can be easily implemented in Theano, using the class defined previously for a denoising autoencoder. We can see the stacked denoising autoencoder as having two facades: a list of autoencoders, and an MLP. During pre-training we use the first facade, i.e., we treat our model as a list of autoencoders, and train each autoencoder seperately. In the second stage of training, we use the second facade. These two facades are linked because: * the autoencoders and the sigmoid layers of the MLP share parameters, and * the latent representations computed by intermediate layers of the MLP are fed as input to the autoencoders. .. literalinclude:: ../code/SdA.py :start-after: start-snippet-1 :end-before: end-snippet-1 ``self.sigmoid_layers`` will store the sigmoid layers of the MLP facade, while ``self.dA_layers`` will store the denoising autoencoder associated with the layers of the MLP. Next, we construct ``n_layers`` sigmoid layers and ``n_layers`` denoising autoencoders, where ``n_layers`` is the depth of our model. We use the ``HiddenLayer`` class introduced in :ref:`mlp`, with one modification: we replace the ``tanh`` non-linearity with the logistic function :math:`s(x) = \frac{1}{1+e^{-x}}`). We link the sigmoid layers to form an MLP, and construct the denoising autoencoders such that each shares the weight matrix and the bias of its encoding part with its corresponding sigmoid layer. .. literalinclude:: ../code/SdA.py :start-after: start-snippet-2 :end-before: end-snippet-2 All we need now is to add a logistic layer on top of the sigmoid layers such that we have an MLP. We will use the ``LogisticRegression`` class introduced in :ref:`logreg`. .. literalinclude:: ../code/SdA.py :start-after: end-snippet-2 :end-before: def pretraining_functions The ``SdA`` class also provides a method that generates training functions for the denoising autoencoders in its layers. They are returned as a list, where element :math:`i` is a function that implements one step of training the ``dA`` corresponding to layer :math:`i`. .. literalinclude:: ../code/SdA.py :start-after: self.errors = self.logLayer.errors(self.y) :end-before: corruption_level = T.scalar('corruption') To be able to change the corruption level or the learning rate during training, we associate Theano variables with them. .. literalinclude:: ../code/SdA.py :start-after: index = T.lscalar('index') :end-before: def build_finetune_functions Now any function ``pretrain_fns[i]`` takes as arguments ``index`` and optionally ``corruption``---the corruption level or ``lr``---the learning rate. Note that the names of the parameters are the names given to the Theano variables when they are constructed, not the names of the Python variables (``learning_rate`` or ``corruption_level``). Keep this in mind when working with Theano. In the same fashion we build a method for constructing the functions required during finetuning (``train_fn``, ``valid_score`` and ``test_score``). .. literalinclude:: ../code/SdA.py :pyobject: SdA.build_finetune_functions Note that ``valid_score`` and ``test_score`` are not Theano functions, but rather Python functions that loop over the entire validation set and the entire test set, respectively, producing a list of the losses over these sets. Putting it all together +++++++++++++++++++++++ The few lines of code below construct the stacked denoising autoencoder: .. literalinclude:: ../code/SdA.py :start-after: start-snippet-3 :end-before: end-snippet-3 There are two stages of training for this network: layer-wise pre-training followed by fine-tuning. For the pre-training stage, we will loop over all the layers of the network. For each layer we will use the compiled Theano function that implements a SGD step towards optimizing the weights for reducing the reconstruction cost of that layer. This function will be applied to the training set for a fixed number of epochs given by ``pretraining_epochs``. .. literalinclude:: ../code/SdA.py :start-after: start-snippet-4 :end-before: end-snippet-4 The fine-tuning loop is very similar to the one in the :ref:`mlp`. The only difference is that it uses the functions given by ``build_finetune_functions``. Running the Code ++++++++++++++++ The user can run the code by calling: .. code-block:: bash python code/SdA.py By default the code runs 15 pre-training epochs for each layer, with a batch size of 1. The corruption levels are 0.1 for the first layer, 0.2 for the second, and 0.3 for the third. The pretraining learning rate is 0.001 and the finetuning learning rate is 0.1. Pre-training takes 585.01 minutes, with an average of 13 minutes per epoch. Fine-tuning is completed after 36 epochs in 444.2 minutes, with an average of 12.34 minutes per epoch. The final validation score is 1.39% with a testing score of 1.3%. These results were obtained on a machine with an Intel Xeon E5430 @ 2.66GHz CPU, with a single-threaded GotoBLAS. Tips and Tricks +++++++++++++++ One way to improve the running time of your code (assuming you have sufficient memory available), is to compute how the network, up to layer :math:`k-1`, transforms your data. Namely, you start by training your first layer dA. Once it is trained, you can compute the hidden units values for every datapoint in your dataset and store this as a new dataset that you will use to train the dA corresponding to layer 2. Once you have trained the dA for layer 2, you compute, in a similar fashion, the dataset for layer 3 and so on. You can see now, that at this point, the dAs are trained individually, and they just provide (one to the other) a non-linear transformation of the input. Once all dAs are trained, you can start fine-tuning the model.