
The choice of the similarity measure is important since it influences the output of
the clustering algorithm. It should be adapted to the question of interest. Figure 9.6
shows two different clustering results using two different distances.
Two classes of clustering algorithms are commonly distinguished, namely, the
hierarchical and partitioning methods (Jain and Dubes 1988; Mirkin 1996). In con-
trast to partitioning methods that try to find the “best” partition given a fixed num-
ber of clusters, hierarchical methods calculate a full series of partitions starting from
n clusters, each of which contains one single data point, and ending with one cluster
that contains all n data points (or vice versa); in each step of the procedure, two clus-
ters are merged according to a pre-specified rule. In the following we describe the
classical hierarchical algorithm and commonly used partitioning methods (self-orga-
nizing maps and K-means).
Not surprisingly, there are many other cluster algorithms used in the context of
gene expression profiling, e. g., algorithms based on graph theoretic concepts (CAST
[Ben-Dor et al. 1999], CLICK [Sharan and Shamir 2000], gene shaving [Hastie et al.
2000], and algorithms that are used in supervised classification approaches such as
support vector machines [Brown et al. 1999]).
9.3.1
Hierarchical Clustering
Let x
1
,…,x
n
be the p-dimensional data points (expression profiles of n gene repre-
sentatives across the p experiments). The process of hierarchical algorithms requires
a dissimilarity measure, d, between pairs of clusters (related to a dissimilarity mea-
sure,
~
d, between pairs of data points) and an update procedure for recalculation of
the merged clusters. It then has the following scheme:
1. For v = n start with the finest possible partition.
2. Calculate a new partition by joining two clusters that minimize d.
3. Update the distances of the remaining clusters and the joined cluster.
4. Stop if v = 1, i.e., all data points are in one cluster; otherwise, repeat steps 1–3. (9-13)
Several cluster dissimilarity measures are in use:
single linkage d C
n
k
; C
n
l
min
x
i
2C
n
k
;x
j
2C
n
l
~
d x
i
; x
j
; (9-14)
complete linkage d C
n
k
; C
n
l
max
x
i
2C
n
k
;x
j
2C
n
l
~
d x
i
; x
j
; (9-15)
311
9.3 Clustering Algorithms
Fig. 9.6 Influence of similarity measure on
clustering. Two dendrograms of a subgroup of
genes using the microarray expression data of
Ross et al. (2000) were generated using hierarch-
ical clustering with Euclidean distance (a) and
Pearson correlation (b) as pairwise similarity
measure. Although all other parameters are kept
constant, results show differences in both gene
and cancer cell line groupings. Clustering was
performed with the J-Express Pro software pack-
age (Molmine, Bergen, Norway).
3