
verification and identification. The former needs to
answer ‘‘Is he who he says he is?’’, while the latter
cares a bout ‘‘who is he?’’. According to whe ther
the unidentified end-user is enrolled in the system,
identification is further categorized into two t ypes:
▶ closed-set identification and ▶ open-set identi-
fication. Typical verification ap plication is access
con trol, while closed-set identificati on can be ap-
plied to mu g shot ret rieval for instance. In su rve il-
lance scenario, open-set identification is also named
‘‘watch list,’’ which aims at answering ‘‘Is he one
of the persons of interest?’’ generally in real time.
For example, face recognition, gait recognition, and
speaker recognition can be applied for this purpose
in a surveillance scenario since they can work in non-
intrusive mode.
Performance evaluations can also be categorized
into three different types: algorithm evaluation, sce-
nario evaluation, and operational evaluation, as is
described in evaluation protocols part of this essay.
Performance Measures
Evidently, different tasks should explore distinct per-
formance measures. For verification, receiver operating
characteristic (ROC) curve is generally used to show
the trade-off between two error rates: false reject rate
(FRR) versus false accept rate (FAR). Sometimes, the
equal error rate (EER) point on the ROC, where FRR is
equal to FAR, is used as a single measurement. As for
identification, identification rate, rank- k identification
rate, or cumulative match characteristic (CMC) is
often used to compare different techniques. The reader
is referred to the
▶ performance measures entry for
more details.
For watch list applications, in some sense, it is
the verification of the rank-1 identification. So, its
performance can be measured by the identification
rate at certain pre-defined FAR, say 0.1%. This mea-
sure is exploited by face recognition vender test
(FRVT) [1].
Except the accuracy measures mentioned above,
there are also some performance criteria measur-
ing the usability of biometric systems, such as
▶ failure
to acquire rate,
▶ enrollment time, ▶ response time,
▶ throughput, and ▶ scalability. The readers are re-
ferred to the definitional entries for their description.
Datasets
The abovementioned performance measures are gen-
erally obtained by testing the biometric systems on
some databases. Evidently, the performance of a system
or algorithm depends not only on its capacity but also
the characteristic of the database. So, here it is worth
noting that pure recognition accuracy, say 100%,
means nothing if the database is not described clearly.
The following factors about validation database must
be considered carefully when performance evaluation
is conducted.
Here, first several distinct datasets in evaluation:
training dataset, validation dataset, gallery, and probe
dataset need to be distinguished. Among them, the
training set is used for learning the biometric models
and designing the recognition algorithms, including
the feature extractor (e.g., principal component analy-
sis and discriminant analysis) and the classifier. Vali-
dation set is used to tune the parameters of the leaned
models or the algorithms, for instance the dimension
of the feature vector or some empirical thresholds. In
some literature, training set and validation set are
combined together and called training set commonly.
For instance, in FVC2006 [2], a subset of finger-
print impressions acquired w ith various sensors was
provided to registered participants, to allow them to
adjust the parameters of their algorithms. The gallery
here means the dataset containing all the registered
biometric traits of all the enrolled users in the system,
that is, the templates for each enrolled users are
extracted from this dataset. Note that, the gallery
is often taken as part of (or the same as) the training
set by many researchers. This is mostly acceptable;
however, in many applications, each enrolled subject
may register only one biometric sample, which implies
that it is impossible to train a feature extractor (e.g.,
Fisher discriminant analysis) or classifier. In this case,
a training set is necessary. The probe dataset con-
tains testing biometric samples that need to be recog-
nized by matching against the templates in the gallery.
Note that, for identification task, all the subjects in
the probe set can be registered subjects (i.e., with at
least one template), while for verification and watch
list applications , part of the subjects in the probe set
should be unregistered subjects, which is used as
impostors to estimate the false accept rate. In litera-
ture, the gallery and probe set are called together
Performance Evaluation, Overview
P
1059
P