.. index:: Logistic Regression .. _logreg : Classifying MNIST digits using Logistic Regression ================================================== .. note:: This sections assumes familiarity with the following Theano concepts: `shared variables`_ , `basic arithmetic ops`_ , `T.grad`_ , `floatX`_. If you intend to run the code on GPU also read `GPU`_. .. note:: The code for this section is available for download `here`_. .. _here: http://deeplearning.net/tutorial/code/logistic_sgd.py .. _shared variables: http://deeplearning.net/software/theano/tutorial/examples.html#using-shared-variables .. _basic arithmetic ops: http://deeplearning.net/software/theano/tutorial/adding.html#adding-two-scalars .. _T.grad: http://deeplearning.net/software/theano/tutorial/examples.html#computing-gradients .. _floatX: http://deeplearning.net/software/theano/library/config.html#config.floatX .. _GPU: http://deeplearning.net/software/theano/tutorial/using_gpu.html In this section, we show how Theano can be used to implement the most basic classifier: the logistic regression. We start off with a quick primer of the model, which serves both as a refresher but also to anchor the notation and show how mathematical expressions are mapped onto Theano graphs. In the deepest of machine learning traditions, this tutorial will tackle the exciting problem of MNIST digit classification. The Model +++++++++ Logistic regression is a probabilistic, linear classifier. It is parametrized by a weight matrix :math:`W` and a bias vector :math:`b`. Classification is done by projecting an input vector onto a set of hyperplanes, each of which corresponds to a class. The distance from the input to a hyperplane reflects the probability that the input is a member of the corresponding class. Mathematically, the probability that an input vector :math:`x` is a member of a class :math:`i`, a value of a stochastic variable :math:`Y`, can be written as: .. math:: P(Y=i|x, W,b) &= softmax_i(W x + b) \\ &= \frac {e^{W_i x + b_i}} {\sum_j e^{W_j x + b_j}} The model's prediction :math:`y_{pred}` is the class whose probability is maximal, specifically: .. math:: y_{pred} = {\rm argmax}_i P(Y=i|x,W,b) The code to do this in Theano is the following: .. literalinclude:: ../code/logistic_sgd.py :start-after: start-snippet-1 :end-before: end-snippet-1 Since the parameters of the model must maintain a persistent state throughout training, we allocate shared variables for :math:`W,b`. This declares them both as being symbolic Theano variables, but also initializes their contents. The dot and softmax operators are then used to compute the vector :math:`P(Y|x, W,b)`. The result ``p_y_given_x`` is a symbolic variable of vector-type. To get the actual model prediction, we can use the ``T.argmax`` operator, which will return the index at which ``p_y_given_x`` is maximal (i.e. the class with maximum probability). Now of course, the model we have defined so far does not do anything useful yet, since its parameters are still in their initial state. The following section will thus cover how to learn the optimal parameters. .. note:: For a complete list of Theano ops, see: `list of ops `_ Defining a Loss Function ++++++++++++++++++++++++ Learning optimal model parameters involves minimizing a loss function. In the case of multi-class logistic regression, it is very common to use the negative log-likelihood as the loss. This is equivalent to maximizing the likelihood of the data set :math:`\cal{D}` under the model parameterized by :math:`\theta`. Let us first start by defining the likelihood :math:`\cal{L}` and loss :math:`\ell`: .. math:: \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) = \sum_{i=0}^{|\mathcal{D}|} \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\ \ell (\theta=\{W,b\}, \mathcal{D}) = - \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) While entire books are dedicated to the topic of minimization, gradient descent is by far the simplest method for minimizing arbitrary non-linear functions. This tutorial will use the method of stochastic gradient method with mini-batches (MSGD). See :ref:`opt_SGD` for more details. The following Theano code defines the (symbolic) loss for a given minibatch: .. literalinclude:: ../code/logistic_sgd.py :start-after: start-snippet-2 :end-before: end-snippet-2 .. note:: Even though the loss is formally defined as the *sum*, over the data set, of individual error terms, in practice, we use the *mean* (``T.mean``) in the code. This allows for the learning rate choice to be less dependent of the minibatch size. Creating a LogisticRegression class +++++++++++++++++++++++++++++++++++ We now have all the tools we need to define a ``LogisticRegression`` class, which encapsulates the basic behaviour of logistic regression. The code is very similar to what we have covered so far, and should be self explanatory. .. literalinclude:: ../code/logistic_sgd.py :pyobject: LogisticRegression We instantiate this class as follows: .. literalinclude:: ../code/logistic_sgd.py :start-after: index = T.lscalar() :end-before: # the cost we minimize during We start by allocating symbolic variables for the training inputs :math:`x` and their corresponding classes :math:`y`. Note that ``x`` and ``y`` are defined outside the scope of the ``LogisticRegression`` object. Since the class requires the input to build its graph, it is passed as a parameter of the ``__init__`` function. This is useful in case you want to connect instances of such classes to form a deep network. The output of one layer can be passed as the input of the layer above. (This tutorial does not build a multi-layer network, but this code will be reused in future tutorials that do.) Finally, we define a (symbolic) ``cost`` variable to minimize, using the instance method ``classifier.negative_log_likelihood``. .. literalinclude:: ../code/logistic_sgd.py :start-after: classifier = LogisticRegression(input=x, n_in=28 * 28, n_out=10) :end-before: # compiling a Theano function that computes the mistakes Note that ``x`` is an implicit symbolic input to the definition of ``cost``, because the symbolic variables of ``classifier`` were defined in terms of ``x`` at initialization. Learning the Model ++++++++++++++++++ To implement MSGD in most programming languages (C/C++, Matlab, Python), one would start by manually deriving the expressions for the gradient of the loss with respect to the parameters: in this case :math:`\partial{\ell}/\partial{W}`, and :math:`\partial{\ell}/\partial{b}`, This can get pretty tricky for complex models, as expressions for :math:`\partial{\ell}/\partial{\theta}` can get fairly complex, especially when taking into account problems of numerical stability. With Theano, this work is greatly simplified. It performs automatic differentiation and applies certain math transforms to improve numerical stability. To get the gradients :math:`\partial{\ell}/\partial{W}` and :math:`\partial{\ell}/\partial{b}` in Theano, simply do the following: .. literalinclude:: ../code/logistic_sgd.py :start-after: # compute the gradient of cost :end-before: # start-snippet-3 ``g_W`` and ``g_b`` are symbolic variables, which can be used as part of a computation graph. The function ``train_model``, which performs one step of gradient descent, can then be defined as follows: .. literalinclude:: ../code/logistic_sgd.py :start-after: start-snippet-3 :end-before: end-snippet-3 ``updates`` is a list of pairs. In each pair, the first element is the symbolic variable to be updated in the step, and the second element is the symbolic function for calculating its new value. Similarly, ``givens`` is a dictionary whose keys are symbolic variables and whose values specify their replacements during the step. The function ``train_model`` is then defined such that: * the input is the mini-batch index ``index`` that, together with the batch size (which is not an input since it is fixed) defines :math:`x` with corresponding labels :math:`y` * the return value is the cost/loss associated with the x, y defined by the ``index`` * on every function call, it will first replace ``x`` and ``y`` with the slices from the training set specified by ``index``. Then, it will evaluate the cost associated with that minibatch and apply the operations defined by the ``updates`` list. Each time ``train_model(index)`` is called, it will thus compute and return the cost of a minibatch, while also performing a step of MSGD. The entire learning algorithm thus consists in looping over all examples in the dataset, considering all the examples in one minibatch at a time, and repeatedly calling the ``train_model`` function. Testing the model +++++++++++++++++ As explained in :ref:`opt_learn_classifier`, when testing the model we are interested in the number of misclassified examples (and not only in the likelihood). The ``LogisticRegression`` class therefore has an extra instance method, which builds the symbolic graph for retrieving the number of misclassified examples in each minibatch. The code is as follows: .. literalinclude:: ../code/logistic_sgd.py :pyobject: LogisticRegression.errors We then create a function ``test_model`` and a function ``validate_model``, which we can call to retrieve this value. As you will see shortly, ``validate_model`` is key to our early-stopping implementation (see :ref:`opt_early_stopping`). These functions take a minibatch index and compute, for the examples in that minibatch, the number that were misclassified by the model. The only difference between them is that ``test_model`` draws its minibatches from the testing set, while ``validate_model`` draws its from the validation set. .. literalinclude:: ../code/logistic_sgd.py :start-after: cost = classifier.negative_log_likelihood(y) :end-before: # compute the gradient of cost Putting it All Together +++++++++++++++++++++++ The finished product is as follows. .. literalinclude:: ../code/logistic_sgd.py The user can learn to classify MNIST digits with SGD logistic regression, by typing, from within the DeepLearningTutorials folder: .. code-block:: bash python code/logistic_sgd.py The output one should expect is of the form : .. code-block:: bash ... epoch 72, minibatch 83/83, validation error 7.510417 % epoch 72, minibatch 83/83, test error of best model 7.510417 % epoch 73, minibatch 83/83, validation error 7.500000 % epoch 73, minibatch 83/83, test error of best model 7.489583 % Optimization complete with best validation score of 7.500000 %,with test performance 7.489583 % The code run for 74 epochs, with 1.936983 epochs/sec On an Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00 Ghz the code runs with approximately 1.936 epochs/sec and it took 75 epochs to reach a test error of 7.489%. On the GPU the code does almost 10.0 epochs/sec. For this instance we used a batch size of 600. Prediction Using a Trained Model ++++++++++++++++++++++++++++++++ ``sgd_optimization_mnist`` serialize and pickle the model each time new lowest validation error is reached. We can reload this model and predict labels of new data. ``predict`` function shows an example of how this could be done. .. literalinclude:: ../code/logistic_sgd.py :pyobject: predict .. rubric:: Footnotes .. [#f1] For smaller datasets and simpler models, more sophisticated descent algorithms can be more effective. The sample code `logistic_cg.py `_ demonstrates how to use SciPy's conjugate gradient solver with Theano on the logistic regression task.