
112 7. Performance Issues (Supervised Learning)
to prune a network parameter is based on some measure of parameter relevance
or significance. A relevance is computed for each parameter and a pruning
heuristic is used to decide when a parameter is considered as being irrelevant
or not. A large initial architecture allows the network to converge reasonably
quickly, with less sensitivity to local minima and the initial network size. Larger
networks have more functional flexibility, and are guaranteed to learn the input-
output mapping with the desired degree of accuracy. Due to the larger functional
flexibility, pruning weights and units from a larger network may give rise to a
better fit of the underlying function, hence better generalization [604].
A more elaborate discussion of pruning techniques is given next, with the main ob-
jective of presenting a flavor of the techniques available to prune NN architectures.
For more detailed discussions, the reader is referred to the given references. The first
results in the quest to find a solution to the architecture optimization problem were
the derivation of theoretical limits on the number of hidden units to solve a particular
problem [53, 158, 436, 751, 759]. However, these results are based on unrealistic as-
sumptions about the network and the problem to be solved. Also, they usually apply
to classification problems only. While these limits do improve our understanding of the
relationship between architecture and training set characteristics, they do not predict
the correct number of hidden units for a general class of problems.
Recent research concentrated on the development of more efficient pruning techniques
to solve the architecture selection problem. Several different approaches to pruning
have been developed. This chapter groups these approaches in the following gen-
eral classes: intuitive methods, evolutionary methods, information matrix methods,
hypothesis testing methods and sensitivity analysis methods.
• Intuitive pruning techniques: Simple intuitive methods based on weight
values and unit activation values have been proposed by Hagiwara [342]. The
goodness factor G
l
i
of unit i in layer l, G
l
i
=
p
j
(w
l
ji
o
l
i
)
2
, where the first
sum is over all patterns, and o
l
i
is the output of the unit, assumes that an
important unit is one that excites frequently and has large weights to other
units. The consuming energy, E
l
i
=
p
j
w
l
ji
o
l+1
j
o
l
j
, additionally assumes that
unit i excites the units in the next layer. Both methods suffer from the flaw that
when a unit’s output is more frequently 0 than 1, that unit might be considered
as being unimportant, while this is not necessarily the case. Magnitude-based
pruning assumes that small weights are irrelevant [342, 526]. However, small
weights may be of importance, especially compared to very large weights that
cause saturation in hidden and output units. Also, large weights (in terms of
their absolute value) may cancel each other out.
• Evolutionary pruning techniques: The use of genetic algorithms (GA) to
prune NNs provides a biologically plausible approach to pruning [494, 712, 901,
904]. Using GA terminology, the population consists of several pruned versions
of the original network, each needed to be trained. Differently pruned networks
are created by the application of mutation, reproduction and crossover operators.
These pruned networks “compete” for survival, being awarded for using fewer
parameters and for improving generalization. GA NN pruning is thus a time-
consuming process.