
statistical models are active shape and active appear-
ance models [13].
The combination of appearance- and shape-based
visual features has also been utilized in expectation of
improving the performance of the recognition system,
since they contain respectively low- and high-
level information about the person’s lip movements.
Appearance- and shape-based features are usually just
concatenated, or a single model of face shape and
appearance is created [13]. The dynamics of the
changes of v isual features are usually captured by aug-
menting the visual feature vector by its first- and
second-order time derivatives, computed over a short
temporal window centered at the current video frame.
Finally, in the classification phase, a num ber of
classifiers can be used to model prior knowledge of
how the visual features are generated by each speaker.
They are usually statistical in nature and utilize ANNs,
SVMs, GMMs, HMMs, etc. The parameters of the
prior model are estimated during training. During
testing, based on the trained model the posterior prob-
ability is maximized, and identification/verification
decision is made.
Summary
Lip movement recognition and audio–visual-dynamic
speaker recognition are speaker recognition technolo-
gies that are user-friendly, low-cost, and resilient to
spoofing. There are many biometric applications,
such as, sport venue entrance check, access to desktop,
building access, etc., in which it is very impor tant to
use unobtrusive methods for extracting biometric fea-
tures, thus enabling natural person recognition and
reducing inconvenience. Low cost of audio and video
biometric sensors and the ease of acquiring audio and
video signals (even without assistance from the client),
makes biometric technology more socially acceptable
and accelerates its integration into every day life.
Related Entries
▶ Face Recognition
▶ Face Tracking
▶ Multibiometrics
▶ Multibiometrics and Data Fusion
▶ Multimodal Systems
▶ Speaker Matching
▶ Session Effects on Speaker Modeling
▶ Speaker Recognition, Overview
▶ Speech Analysis
▶ Speech Production
▶ Spoofing
References
1. Chen, T., Rao, R.R.: Audio-visual integration in multimodal
communication. Proc. IEEE 86(5), 837–852 (1998)
2. Aleksic, P.S., Potamianos, G., Katsaggelos, A.K.: Exploiting visual
information in automatic speech processing. In: Bovik, A.L.
(ed.) Handbook of Image and Video Processing. Academic,
London (2005)
3. Aleksic, P.S., Katsaggelos, A.K.: Speech-to-video synthesis using
MPEG-4 compliant visual features. IEEE Trans CSVT, Special
Issue on Audio and Video Analysis for Multimedia Interactive
Services, pp. 682–692, May (2004)
4. Summerfield, A.Q.: Some preliminaries to a comprehensive
account of audio-visual speech perception. In: Campbell, R.,
Dodd, B. (eds.) Hearing by Eye: The Psychology of Lip-Reading,
pp. 3–51. Lawrence Erlbaum, London, United Kingdom (1987)
5. Aleksic, P.S. Katsaggelos, A.K.: Audio-visual biometrics. IEEE
Proc 94(11), 2025–2044 (2006)
6. Chaudhari, U.V., Ramaswamy, G.N., Potamianos, G., Neti, C.:
Audio-visual speaker recognition using time-varying stream re-
liability prediction. IEEE Proc. Int. Conf. Acoustics Speech Sig-
nal Process. (Hong Kong, China) 5, V-712–15 (2003)
7. Hjelmas, E., Low, B.K.: Face detection: A survey. Computer
Vision. Image Understand. 83(3), 236–274 (2001)
8. Hennecke, M.E., Stork, D.G., Prasad, K.V.: Visionary speech:
Looking ahead to practical speechreading systems. In: Stork,
D.G., Hennecke, M.E. (eds.) Speechreading by Humans and
Machines, pp. 331–349. Springer, Berlin (1996)
9. Aleksic, P.S., Katsaggelos, A.K.: Comparison of low- and high-
level visual features for audio-visual continuous automatic
speech recognition. IEEE Proc. Int. Conf. Acoustics Speech
Signal Process. (Montreal, Canada) 5, 917–920 (2004)
10. Potamianos, G., Graf, H.P., Cosatto, E.: An image transform
approach for HMM based automatic lipreading. Paper presented
at the Proceedings of the International Conference on Image
Processing, vol. 1, pp. 173–177. Chicago, IL, 4–7 Oct. 1998
11. Wark, T., Sridharan, S., Chandran, V.: Robust speaker verifica-
tion via fusion of speech and lip modalities. Proc. Int. Conf.
Acoustics Speech Signal Process. Phoenix 6, 3061–3064 (1999)
12. Aleksic, P.S., Katsaggelos, A.K.: An audio-visual person identifi-
cation and verification system using FAPs as visual features.
Paper presented at the Proceedings of Works. Multimedia User
Authentication, pp. 80–84. Santa Barbara, CA (2003)
13. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance mod-
els. Paper presented at the Proceedings of European Conference
on Computer Vision, pp. 484–498. Freiburg, Germany (1998)
Lip Movement Recognition
L
907
L