
that the gesture accuracy improved 15-fold by adding tactile feedback when the
virtual knife hit the virtual flesh.
Applying such feedback to accessory-free gesture interfaces such as a virtual
reality CAVE (an environment in which the user is surrounded by back-projected
screens on all four sides, ceiling, and floor) with camera-based tracking of body
parts will be challenging. Tactile (or haptic) feedback is often given through phys-
ical contact with accessory devices via vibrations (Caldwell et al., 1999; Langdon
et al., 2000). However, device-free tactile feedback could also be possible through
subsonic sound waves (Mu
¨
ller-Tomfelde & Steiner, 2001). Using this approach
could enable usage in applications where physical and wired devices would be
unnatural.
A common solution is to use visual and audio feedback, but these are not tac-
tile. Another solution could be to research the usage of directed low-frequency
audio waves, as this would feel like tactile feedback.
Spatial versus Temporal Perceptive Relation and Precision
A popular solution in multimodal interface studies is the complementary usage
of speech and gesture. These modalities complement each other well because
vision relates mainly to sp atial perception, while sound r elates mainly to tempo-
ral perception. For example, an experiment was conducted where test subjects
saw a dot blinking once on the monitor while hearing two clicks within a certain
time frame (Vroomen & Gel der, 2004). The result was that the test subjects
perceived two blinks and e ven three blinks when the sound clicked thrice.
This demonstrated a c lear complementary merging of senses with a dominant
audio cue for temporal cogniti on. But humans are muc h better at establishing
distance and direction from visual cues than from auditory cues (Loomis et al.,
1998).
When you design a visual- and an audio-based detection and synthesis system
synchronization problems can arise because their response times may be different.
A physically accurate detection and synthesis model tends to reduce response time
performance. Long response time can cause ambiguity and error between the
modes, while good synchronization solves ambiguity and minimizes errors.
Consider an elaborated example such as a virtual-reality application where
you can pick up items for various tasks. A voice recognition system reacts to the
phrase “Pick up” to trigger this action, and the application uses your hand position
to identify the virtual item. However, you move your hand slightly fast, because
you are going to do a lot of work. The visual recognition system lags behind,
which causes a 0.5-second discrepancy between the hand position and where
the application thinks you are. Furthermore, the speech recognition might have
a 1.5-second lag. If you say “Pick up” when your real hand is on top of the virtual
object, your hand has moved on from the item that you wanted and you may pick
up another item or none at all. If it were a menu item, you might have chosen
Delete everything instead of Save.
3 Gesture Interfaces
84