Numerical
Backpropagation Methods
The aim of training is to minimize the sum of squares error energy function
-

where
-

We need to define iterations andsweeps.
An iteration involves a single training datum to the system. A sweep covers
the presentation of an entire block of training data. In most training
pratices, multiple sweeps are used and the training patterns are repeatedly
presented in a cyclic manner. Assume for convenience that the number of
input units, the number of hidden units, and the number of output units
are equal to N. For the two-layer case, the conventional gradient
descent BP requires
per iteration. For three layer networks, the conventional BP
method requires
operations per iteration.
The approximation formulation can be viewed as a numerical optimization
problem. The numerical method adopted has a direct effect on the performance
of the BP algorithm. The numerical performance
of the BP method depends on the following
key factors.
-
Frequency of update: Data adaptive vs. Block Adaptive:
-
The choice is between block adaptive methods (one update per block)
and
data adaptive methods (one update per datum). The data-adaptive
methods update the weights at each iteration, as opposed to the block
methods, which execute the update upon the completion of each sweep.
The backpropagation learning algorithm can
be executed in either mode. The two approaches have exhibited significantly
different numerical performances. Block adaptive methods are known
to be more robust since the training step averaged over all the training
patterns. In contrast, data adaptive methods can be appealing for
some on-line adaptation applications. They are more sensitive to the noise
effect on individual patterns. Block methods are recommended for
applications where real time learning is not necessary.
-
Direction of Update: First order vs. Second order
-
For both the first order and the second order methods, the BP
algorithm's sole role is as an effective tool for computing the gradients.
Second order gradient are numerically sensitive to compute. Therefore,
data adaptive methods are usually based on first order gradients. Block
adaptive methods, being numerically more robust, often adopt a second
order technique. The second order methods in general deliver superior numerical
performance, although such a comparison is neither conclusive nor universally
valid.
Data adaptive methods
Without loss of generality, let us focus on a single output neural network
with the model function
. The output
response for each pattern
is denoted
by
-

We assume that there are M pairs of input and target training patterns
.
The mth input training pattern is denoted
,
where
denotes the number of input
dimensions. The least squares error criterion
becomes
-

For data adaptive updating, it is natural to adopt the first order gradient
descent direction. The weights change from the present point toward the
direction of the local basin of the energy function.
-

Moreover, by a first order Taylor series expansion around
,
-

which is expected to closely approach the teacher value
,
that is
-

Substituting
into this equation
yields
which, in turn, leads to the following estimate of an (idealistic) learning
rate
-

Practically speaking, however, the rate must be further scaled down by
another factor
:
-

Block Adaptive Methods
The block adaptive methods update the weights after the entire
block of training data is presented. In general, data adaptive
methods are more aggressive and block adaptive methods tend
to be more prudent.
For the block adaptive method, the energy functions becomes
-

where
is the number of output nodes,
and M is the block size, that is, the number of training data in
one block. The first order (block) gradient method is a modified version
of
:
-

For convenience, we use a (long) vector
to denote the set of all weights in all layers.
The effectiveness of any learning algorithm lies in the selection of
an optimal update direction. The direction of update depends on wether
a first order or second order gradient method is adopted. In block
adaptive methods, it could be usful to involve a second order momentum
term in determinig the direction of update. It often involves a direct
or indirect computation of Hessian matrix and its subsequent inversion.
Simulation and other numerical studies suggest that the second order update
direction seems to be more effective than its first order counterpart.
In the following, major second order techniques are introduced and their
relationships highlighted.
Example
Contents
Artificial Neural Networks
About this Tutorial