
may be ample opportunity for an attack er, especially if
the system microphone is situated in a public area, to
plan and carry out a surreptitious recording of the pass-
phrase, uttered by the client, and to replay the recorded
client passphrase fraudulently in order to be authorized
by the system.
In contrast, text-independent voice authentication
systems [2] will aut henticate a client – and reject an
impostor – irrespective of any particular utterance
used during enrolment. Client enrolment for text-
independent systems invariably takes longer than en-
rolment for a text-dependent system and usually
involves a judiciously designed enrolment text, which
contains all, or at least most, of the speech sounds of
the language. This will ensure that the client models,
which are constructed from the enrolment speech data,
will represent to the largest extent possible the idio-
syncrasies of the client when an ar bitrary sentence or
other utterance is provided for authentication later.
Text-independent protocols offer the advantage that
authentication can be carried out without the need
for a particular passphrase, for example, as part of an
ordinary interaction between a client and a customer-
service agent or automated call centre agent, as shown
in this fictitious dialog:
Client phones XYZ Bank.
Agent: Good morning, this is XYZ Bank. How can I
help you?
Client: I would like to enquire about my account
balance.
Agent: What is your account number?
Client: It’s 123-4567-89
Agent: Good morning, Ms Applegate, the balance
of your account number 123-4567-89 is $765.43. Is
there anything else...?
The example shows a system, which combines speech
recognition with voice authentication. The speech rec-
ognizer understands what the customer wants to know
and recognizes the account number, while the authenti-
cation systems uses the text-independent protocol to
ascertain the identity of the client from the first two
responses the client gives over the telephone. These
responses would not normally have been encountered
by the system during enrolment, but the coverage of the
different speech sounds during enrolment would be
sufficient for the authentication system to verify the
client from the new phrases. The text-independent pro-
tocol offers an attacker the opportunity to record any
client utterances either in the context of the client
using the authentication system or elsewhere, and to
replay the recorded client speech in order to fraudu-
lently achieve authenticati on by the system.
A more secure variant of the text-independent pro-
tocol is the text-prompted protocol [3]. Enrolment
under this protocol is simil ar to the text-independent
protocol in that it aims to achieve a comprehensive
coverage of the different possible speech sounds of
a client so that later on any utterance can be used
for client authentication. However, during authenti-
cation the text-prompted protocol asks the user to
say a specific, randomly chosen phrase, for example,
by prompting the user ‘‘please say the number se-
quence ‘two-four-six’’’. When the client repeats the
prompted text, the system uses automatic speech rec-
ognition to verify that the client has spoken the correct
phrase. At the same time it verifies the client’s voice by
means of the text-independent voice authentication
paradigm. The text-prompted protocol makes a replay
attack more difficult because an attacker would
be unlikely to have all possible prompted texts from
the client recorded in advance. However, such
an attack would still be feasible for an attacker with
a digital playback device that could construct the
prompted text at the press of a button. For example,
an attacker who has managed surreptitiously to record
the ten digits ‘‘zero’’ to ‘‘nine’’ from a client – either on
a single occasion or on several separate occasions –
could store those recorded digits on a notebook com-
puter and then combine them to any prompted digit
sequence by simply pressing buttons on the computer.
Synthesis Attack
Even a text-prompted authentication system is vulner-
able to an attacker who uses a text-to-speech (TTS)
synthesizer. A TTS system allows a user to input any
desired text, for example, by means of a computer
keyboard, and to have that text rendered automatically
into a spoken utterance and output through a loud-
speaker or another analog or digital output channel.
The basic principle is that an attacker would program a
TTS synthe sizer in such a way that it produces similar
speech patterns as the target speaker. If that is achieved,
the attacker would only need to type the text that is
required or prompted by the authentication system in
order for the TTS synthesizer to play the equivalent
synthetic utteran ce to the authentication system in the
Liveness Assurance in Voice Authentication
L
919
L