P1: OTA/XYZ P2: ABC
JWST061-10 JWST061-Caers March 29, 2011 12:55 Printer Name: Yet to Come
182 CH 10 MODELING RESPONSE UNCERTAINTY
10.4.2 Earth Model Selection by Clustering
10.4.2.1 Introduction
Clustering techniques are well-known tools in computer science as well as in many other
areas of science, including Earth Sciences. Two forms of clustering are known: supervised
and unsupervised. We will primarily deal with the latter. In unsupervised clustering, the
aim is to divide a set of “objects” into mutually exclusive classes based on information
provided on that object. What is unknown are the number of classes and what features or
attributes of the object should be used to make such division. For example, 100 bottles
of wine (objects) are on the table, but the label for each bottle is hidden to the expert.
A wine expert can group these bottles for example by grape variety or region of origin
(attributes) simply by tasting the wine. The better the expert, the more refined the group-
ing will be and the more classes may exist. The decision on what attributes to use is
therefore an important aspect of the clustering exercise and the topic of considerable re-
search in computer science known as “pattern recognition.” The combination of (grape
variety, origin) is an example of a pattern. In our application, an Earth model is such an
“object” and the aim is to group these Earth models into various classes with the idea that
each class of models has a similar response, without the need for evaluating responses
on each model. If this can be achieved successfully, then one single model of a cluster
or group can be selected for response evaluation. Similarly, the wine expert can take a
bottle out of each group to represent the variety of wines on the table without needing to
select all 100 bottles of wine. The labels revealed for each bottle is the equivalent of our
response function.
The number of classes, clusters or groups can either be decided on (a) how many
response evaluations are affordable (a CPU issue) or (b) how many response evaluations
are needed to obtain a realistic assessment of uncertainty on the response (an accuracy
issue). This question is addressed later; addressed first is the question of how to cluster
without evaluating responses on each Earth model, which would defeat the purpose of
clustering itself.
If most clustering techniques in the computer science literature (e.g., k-means cluster-
ing, tree-methods, etc.) are considered then it is observed that the mathematics behind
these methods calls for the definition of a distance. Indeed, a distance will define how
similar each object is to any other object (recall our puzzle pieces analogy of similarity),
allowing grouping of objects. However, the response function itself cannot be used to
define such distance, a distance that is relatively easy and rapid to determine is needed. A
wine expert could use the difference in wine color, color intensity, smell and difference
in coating (or formation of “legs”) on the glass as a way to distinguish wines without
even tasting them (or worse, looking at the label). Similarly, for Earth models, the defi-
nition of a meaningful distance will make clustering effective and efficient. The elegance
here lies in the fact one requires just a single distance definition, not requiring neces-
sarily the specification of attributes or features to sort models, but one should be able
to evaluate this distance rapidly. For this purpose, standard distances can be used, such
as the Euclidean distance, Manhattan distance or Hausdorff distance, or surrogate/proxy