
the testing set. Sometimes, gallery is also called target
set, while probe set is also called query set, e.g., in
FRGC [3].
When collecting biometric samples in the above-
mentioned four datasets, some points should be
considered carefully. First, the samples in the gallery
should never be included in the probe set also, since
this will definitely result in correct match. Secondly,
whether the samples in the testing set should be
contained in the training set is task-dependent.
Thirdly, whether the subjects in the training and
testing set are overlapped partially or completely is
application-dependent. For instance, in face recogni-
tion technology (FERET) evaluation [4] , part of the
face images (and subject) in the gallery and probe sets
are also in the training set for algorithm development.
However, in FRVT [1], all the images (and subjects) in
the testing set are confidential to all the participants,
which means the developers have to consider carefully
how their algorithms can generalize to unseen subjects .
In contrast, the Lausanne Protocol based on XM2VTS
database [5] does not distinguish the training set from
the gallery, i.e., the gallery is the same as the training
set. Evidently, different protocol will result in evalua-
tion of different difficulty.
Difficulty Control
The goals of performance evaluation are multifold,
such as to compare several algorithms and choose
the best one , or determine whether one technology
can meet the requirements of specific applications.
So, it is very import ant to control the difficulty of the
evaluation. The evaluation itself should not be too
hard or too easy. If the evaluation is too easy, all the
technologies might have similarly perfect perfor-
mances and thus no statistically salient difference
can be observed among the results. Simil arly, if the
evaluation is too challenging, all the systems may not
work and have bad performance. Therefore, it is
indeed very important to control the difficulty of the
evaluation in order to make the participants perform
discrepantly.
The difficulty of an evaluation protocol is mainly
determined by the variations of the samples in the
probe set from the registered ones. The more the vari-
ation, the more difficult the evaluation is. The sources
of the variations are multifold. Coarsely, they can be
categorized in to two classes: int rinsic variations and
extrinsic variations. The former means the changes of
the biometric feature itself, while the latter comes from
the external factors especially during the sensing pro-
cedure. For instance, in face recognition, variations in
the facial appearance due to the expression and aging
are intrinsic, while those due to lighting, viewpoint,
camera difference, and partial occlusion are extrinsic.
For instance, more recently, multiple biometric grand
challenge (MBGC) [6] is being organized to investi-
gate, test, and improve performance of face and iris
recognition technolog y on both still and video imag-
ery through a series of challenge problems, such as
low resolution, off-angle images, unconstrained f ace
imaging conditions etc. Especially, fo r all biometrics,
the time inter val between the acquisition of the
registered sample and unseen sample s presented to
a system is an important factor, because different
acquisition time implies both intri nsic and extrin-
sic variations. For an evaluation of academic algo-
rithms, a reasonable distribution of all the possible
variations in the testing set is desirable, while for
application-specific system evaluation it is better to
include variations most po ssibly appeari ng in the
practical applications.
The abovementioned datab ase structure also affects
the difficulty of the evaluation. If the samples or the
subjects in the testing set have been included in the
training set, the evaluation becomes relatively easy. If
all the testing samples and subjects are novel to the
learned model or system, the overfitting problem
might make the task more challenging. In extreme
case, if the training set and the testing set are heteroge-
neous, the task will be much more difficult. For in-
stance, if the training set contains only biometric
samples of Mongolian, while the testing samples are
from the western. Therefore, the structure of the data-
base for evaluation should be carefully designed to
tune the difficulty of the evaluation.
Another factor influencing the evaluation diffi-
culty is the database size, i.e., the number of registered
subjects in the database. This is especially impor-
tant for identification and watch list applications,
since evidently the more subjects to recognize,
the more challenging the problem becomes. Some
observations and conclusions have been drawn in
FRVT2002 [1].
1060
P
Performance Evaluation, Overview