
106 7. Performance Issues (Supervised Learning)
where the complexity ranked order is maintained, but training is not done on each
complete complexity subset. Instead, each complexity subset is further divided into
smaller random subsets. Training starts on an initial subset of a complexity subset,
and is incrementally increased during training. Finally, delta training strategies were
proposed [138]. With delta subset training examples are ordered according to inter-
example distance, e.g. Hamming or Euclidean distance. Different strategies of example
presentations were investigated: smallest difference examples first, largest difference
examples first, and alternating difference.
When vast quantities of data are available, training on all these data can be pro-
hibitively slow, and may require reduction of the training set. The problem is which
of the data should be selected for training. An easy strategy is to simply sample a
smaller data set at each epoch using a uniform random number generator. Alterna-
tively, a fast clustering algorithm can be used to group similar patterns together, and
to sample a number of patterns from each cluster.
7.3.2 Weight Initialization
Gradient-based optimization methods, for example gradient descent, is very sensitive
to the initial weight vectors. If the initial position is close to a local minimum, con-
vergence will be fast. However, if the initial weight vector is on a flat area in the
error surface, convergence is slow. Furthermore, large initial weight values have been
shown to prematurely saturate units due to extreme output values with associated
zero derivatives [400]. In the case of optimization algorithms such as PSO and GAs,
initialization should be uniformly over the entire search space to ensure that all parts
of the search space are covered.
A sensible weight initialization strategy is to choose small random weights centered
around 0. This will cause net input signals to be close to zero. Activation functions
then output midrange values regardless of the values of input units. Hence, there is no
bias toward any solution. Wessels and Barnard [898] showed that random weights in
the range [
−1
√
fanin
,
1
√
fanin
] is a good choice, where fanin is the number of connections
leading to a unit.
Why are weights not initialized to zero in the case of gradient-based optimization?
This strategy will work only if the NN has just one hidden unit. For more than
one hidden unit, all the units produce the same output, and thus make the same
contribution to the approximation error. All the weights are therefore adjusted with
the same value. Weights will remain the same irrespective of training time – hence,
no learning takes place. Initial weight values of zero for PSO will also fail, since no
velocity changes are made; therefore no weight changes. GAs, on the other hand, will
work with initial zero weights if mutation is implemented.