
3.2 Supervised Learning Rules 37
approximation to µ(z) is found from the training set D
T
, memorization is determined
from D
V
(more about this later), and the generalization accuracy is estimated from
the test set D
G
(more about this later).
Since prior knowledge about Ω(D) is usually not known, a nonparametric regression
approach is used by the NN learner to search through its hypothesis space H for a
function f
NN
(D
T
, W) which gives a good estimation of the unknown function µ(z),
where f
NN
(D
T
, W) ∈H. For multilayer NNs, the hypothesis space consists of all
functions realizable from the given network architecture as described by the weight
vector W .
During learning, the function f
NN
: R
I
−→ R
K
is found which minimizes the empirical
error
E
T
(D
T
; W)=
1
P
T
P
T
p=1
(F
NN
(z
p
, W) − t
p
)
2
(3.25)
where P
T
is the total number of training patterns. The hope is that a small empirical
(training) error will also give a small true error, or generalization error, defined as
E
G
(Ω; W)=
(f
NN
(z, W) − t)
2
dΩ(z, t) (3.26)
For the purpose of NN learning, the empirical error in equation (3.25) is referred
to as the objective function to be optimized by the optimization method. Several
optimization algorithms for training NNs have been developed [51, 57, 221]. These
algorithms are grouped into two classes:
• Local optimization, where the algorithm may get stuck in a local optimum
without finding a global optimum. Gradient descent and scaled conjugate gra-
dient are examples of local optimizers.
• Global optimization, where the algorithm searches for the global optimum
by employing mechanisms to search larger parts of the search space. Global
optimizers include LeapFrog, simulated annealing, evolutionary algorithms and
swarm optimization.
Local and global optimization techniques can be combined to form hybrid training
algorithms.
Learning consists of adjusting weights until an acceptable empirical error has been
reached. Two types of supervised learning algorithms exist, based on when weights
are updated:
• Stochastic/online learning, where weights are adjusted after each pattern
presentation. In this case the next input pattern is selected randomly from
the training set, to prevent any bias that may occur due to the order in which
patterns occur in the training set.
• Batch/offline learning, where weight changes are accumulated and used to
adjust weights only after all training patterns have been presented.