The training strategy class
Roberto Lopez.
[email protected]
Artelnics - Making intelligent use of data

In this tutorial we formulate the learning problem for neural networks and describe some learning tasks that they can solve.

Contents:
  1. BasicTheory.
  2. Software model.
  3. TrainingStrategy classes.

1. Basic theory

As we saw in the previous chapter the performance functional has a performance function associated. The performance function for a neural network can be visualized as a hypersurface, with the parameters as coordinates, the following figure represents the performance function.
Geometrical representation of the performance function.

The minimum or maximum value of the performance functional is achieved for a vector of parameters at which the performance function takes on a minimum or maximum value. Therefore, the learning problem for neural networks, formulated as a variational problem, can be reduced to a function optimization problem.

In this sense, a variational formulation for neural networks provides a direct method for solving variational problems. The universal approximation properties for the multilayer perceptron cause neural computation to be a very appropriate paradigm for the solution of these problems.

One-dimensional optimization

Although the performance function is multidimensional, one-dimensional optimization methods are of great importance. Indeed, one-dimensional optimization algorithms are very often used inside multidimensional optimization algorithms.

A function is said to have a relative or local minimum at some point if the function is always greater within some neighbourhood of that point. Similarly, a point is called a relative or local maximum if the function is always lesser within some neighbourhood of that point.

The function is said to have a global or absolute minimum at some point if the function is always greater within the whole domain. Similarly, a point will be a global maximum if the function is always greater within the whole domain. Finding a global optimum is, in general, a very difficult problem.

On the other hand, the tasks of maximization and minimization are trivially related to each other, since maximization of a function is equivalent to minimization of its negative, and vice versa.

In this regard, a one-dimensional optimization problem is one in which the argument which minimizes the performance function is to be found.

The necessary condition states that if the directional performance function has a relative optimum and if the derivative exists as a finite number. The condition for the optimum to be a minimum is that the second derivative is greater than zero, and vice versa.

The most elementary approach for one-dimensional optimization problems is to use a fixed step size or training rate. More sophisticated algorithms which are are widely used are the golden section method and the Brent's method. Both of the two later algorithms begin by bracketing a minimum.

Fixed training rate

The most elementary approach for such a problem is to use a fixed step size, or training rate, and move from an initial guess point in a favorable direction (positive or negative). The step size used must be small in relation to the final accuracy desired. Although this method is very simple to implement, it is not efficient in many cases.

The golden section method brackets that minimum until the distance between the two outer points in the bracket is less than a defined tolerance.

The Brent's method performs a parabolic interpolation until the distance between the two outer points defining the parabola is less than a tolerance.

Multi-dimensional optimization

As it was shown in Chapter, the learning problem for neural networks is reduced to the searching for a parameter vector at which the performance function takes a maximum or a minimum value.

The concepts of relative or local and absolute or global optima for the multidimensional case apply in the same way as for the one-dimensional case. The tasks of maximization and minimization are also trivially related here.

The necessary condition states that if the neural network is at a minimum of the performance function, then the gradient is the zero vector.

The performance function is, in general, a non linear function of the parameters. As a consequence, it is not possible to find closed training algorithms for the minima. Instead, we consider a search through the parameter space consisting of a succession of steps, or epochs.

At each epoch, the performance will increase by adjusting the neural network parameters. The change of parameters between two epochs is called the parameters increment.

In this way, to train a neural network we start with some parameters vector (often chosen at random) and we generate a sequence of parameter vectors, so that the performance function is reduced at each iteration of the algorithm. The change of performance between two epochs is called the performance improvement.

The training algorithm stops when a specified condition is satisfied. Some stopping criteria commonly used are:
  1. The parameters increment norm is less than a minimum value.
  2. The performance improvement in one epoch is less than a set value.
  3. Performance has been minimized to a goal value.
  4. The norm of the performance function gradient falls below a goal.
  5. A maximum number of epochs is reached.
  6. A maximum amount of computing time has been exceeded.

A stopping criterion of different nature is early stopping. This method is used in ill-posed problems in order to control the effective complexity of the neural network. Early stopping is a very common practice in neural networks and often produces good solutions to ill-possed problems.

The following figure is a state diagram of the training procedure, showing states and transitions in the training process of a neural network.
Training process.

The training process is determined by the way in which the adjustment of the parameters in the neural network takes place. There are many different training algorithms, which have a variety of different computation and storage requirements. Moreover, there is not a training algorithm best suited to all locations.

Training algorithms might require information from the performance function only, the gradient vector of the performance function or the Hessian matrix of the performance function. These methods, in turn, can perform either global or local optimization.

Zero-order training algorithms make use of the performance function only. The most significant zero-order training algorithms are stochastic, which involve randomness in the optimization process. Examples of these are random search and evolutionary algorithms or particle swarm optimization, which are global optimization methods .

First-order training algorithms use the performance function and its gradient vector. Examples of these are gradient descent methods, conjugate gradient methods, scaled conjugate gradient methods or quasi-Newton methods. Gradient descent, conjugate gradient, scaled conjugate gradient and quasi-Newton methods are local optimization methods.

Second-order training algorithms make use of the performance function, its gradient vector and its Hessian matrix. Examples for second-order methods are Newton's method and the Levenberg-Marquardt algorithm. Both of them are local optimization methods.

Training strategy

Most of the times, application of a single training algorithm is enough to properly train a neural network. The quasi-Newton method is in general a good choice, since it provides good training times and deals successfully with most of the performance functions. The Levenberg-Marquardt algorithm could be also recommended for small and medium-sized data modelling problems.

However, some applications might need more training effort. In that cases we can combine different algorithms in order to do our best. In problems defined by mathematical models, with constraints, etc. a single training algorithm might fail.

Therefore, for difficult problems, we can try two use two or three different training algorithms. A general strategy consists on applying three different training algorithms:

1.Initialization training algorithm
The initialization training algorithm is used to bring the neural network to a stable region of the performance function. Near the optimum, the performance function usually behaves better than far away. Zero order training algorithms, such as random search or the evolutionary algorithm might good for this initialization process. Indeed, they are very robust algorithms.
2.Main training algorithm
The main training algorithm does most of the job. The training strategy relies on them. First order training algorithms, such as the quasi-Newton method are a good choice here.
3.Refinement training algorithm
The refinement training algorithm can be used when a big accuracy is required. Second order training algorithms, such as the Newton-method, require the most exact information of the performance function. Therefore they can perform better for refinement.

2. Software model

The OpenNN training strategy is composed by three algorithms: an initialization, a main and a refinement training algorithms. In this section we study the software model of the TrainingStrategy class.

Classes

In order to construct a software model for the training strategy, a few things need to be taken into account. As we have seen, a training strategy is composed of three different training algorithms. On the other hand, some training algorithms use one-dimensional optimization for finding the optimal training rate. Therefore, the most important classes in the training strategy are:

Training algorithm:
The class TrainingAlgorithm represents a single training algorithm.
Training rate algorithm:
The class TrainingRateAlgorithm represents the one-dimensional optimization algorithm for the training rate.
Training strategy:
The class TrainingStrategy represents a complete training strategy, and it is composed of initialization, main and refinement training algorithms.

Associations

The associations among the concepts described above are the following:

Training algorithm - Training rate algorithm:
A training algorithm might require a training rate algorithm during the optimization process.
Training strategy - Training algorithm:
A training strategy might be composed of different training algorithms.
The next figure shows the UML class diagram for the training strategy with some of the derived classes included.
Training process.

Derived classes

The next task is then to establish which classes are abstract and to derive the necessary concrete classes to be added. Let us then examine the classes we have so far:

Training algorithm
The class TrainingAlgorithm is abstract, because it does not represent a training algorithm for a performance function of a neural network. The concrete training algorithm classes included with OpenNN are RandomSearch, GradientDescent, NewtonMethod, ConjugateGradient, QuasiNewtonMethod and EvolutionaryAlgorithm.
Training rate algorithm
This class is concrete, because it has several one-dimensional methods implemented.
Training strategy
The class TrainingStrategy is concrete, since it is composed by concrete initialization, main and refinement training algorithms.
The following figure shows the UML class diagram for the training strategy with some of the derived classes included.
Derived classes related to the training strategy.

Attributes and operations


Training algorithm
A training algorithm has the following attributes:
It performs the following operations:
Training rate algorithm
A training rate algorithm has the following attributes:
It performs the following operations:
Training strategy
A training strategy has the following attributes:
It performs the following operations:

3. TrainingStrategy classes

OpenNN includes the class TrainingStrategy to represent the concept of training strategy.

Members

The training strategy class contains:

All members are private, and must be accessed or modified by means of get and set methods, respectively.

Constructors

To construct a training strategy object associated to a performance functional object we do the following:

TrainingStrategy ts(&pf);

where pf is some performance functional object.

Methods

The default training strategy consists on a main training algorithm of the quasi-Newton method type. The next sentence sets a different training strategy.

RandomSearch* rsp = new RandomSearch(&pf);
rsp->set_epochs_number(10);
ts.set_initialization_training_algorithm(rsp);

GradientDescent* gdp = new GradientDescent(&pf);
gdp->set_epochs_number(25);
ts.set_main_training_algorithm(gdp);

The most important method of a training strategy is called \lstinline"perform_training". The use is as follows:

ts.perform_training();

where ts is some training strategy object.

We can save the above object to a XML file.

ts.save("training_strategy.xml");

Bibliography

OpenNN Copyright © 2014 Roberto Lopez (Artelnics)