69
The above validity indices are suitable for hard clustering. Validity indices
have been developed for fuzzy clustering. The interested reader is referred to Halkidi
et al. [2001] for more information.
3.1.5 Determining the Number of Clusters
Most clustering algorithms require the number of clusters to be specified in advance
[Lee and Antonsson 2000; Hamerly and Elkan 2003]. Finding the "optimum" number
of clusters in a data set is usually a challenge since it requires a priori knowledge,
and/or ground truth about the data, which is not always available. The problem of
finding the optimum number of clusters in a data set has been the subject of several
research efforts [Halkidi et al. 2001; Theodoridis and Koutroubas 1999], however,
despite the amount of research in this area, the outcome is still unsatisfactory
[Rosenberger and Chehdi 2000]. In the literature, many approaches to dynamically
find the number of clusters in a data set were proposed. In this section, several
dynamic clustering approaches are presented and discussed.
ISODATA (Iterative Self-Organizing Data Analysis Technique), proposed by
Ball and Hall [1967], is an enhancement of the K-means algorithm (K-means is
sometimes referred to as basic ISODATA [Turi 2001]). ISODATA is an iterative
procedure that assigns each pattern to its closest centroids (as in K-means). However,
ISODATA has the ability to merge two clusters if the distance between their centroids
is below a user-specified threshold. Furthermore, ISODATA can split elongated
clusters into two clusters based on another user-specified threshold. Hence, a major
advantage of ISODATA compared to K-means is the ability to determine the number
of clusters in a data set. However, ISODATA requires the user to specify the values of