Audio-visual speech recognition systems can be divided into systems that integrate audio-visual features before decisions are made (feature fusion) and those that integrate decisions of separate recognisers for each modality (decision fusion).
A commonly held view of auditory scene analysis is that complex auditory environments are segregated into separate perceptual streams using primitive cues that can be attended to separately. We argue that this view is inconsistent with the majority of perceptual data reported in the literature and propose an alternative model that is based on a primary, low resolution signal representation used in a passive pattern matching stage, augmented by secondary, high resolution representations that can be used in an active pattern matching stage to formulate hypotheses about the auditory scene.
The perception of a synthesized nasal /m/ changes to /n/ as the frequency of the second formant of a preceding vowel is increased when there are no transitions between the vowel and nasal; but /m/ is heard consistently when 20-ms transitions are introduced. A possible explanation is provided by auditory scene analysis: Formants of the vowel and nasal that are contiguous and close in frequency may be grouped using principles of similarity and good continuity into a single perceptual stream. Further experiments found a robust change in percept for prototype /n/ as well as /m/. Transitions from the vowel that conflict with the formant structure of the consonant also cause a similar change in percept for both nasal prototypes. The proximity of the second formant of the vowel to a formant in the nasal prototype is therefore unnecessary for the change in percept to occur; the presence of the vowel formant near to one of two target frequencies at the boundary with the nasal seems to be sufficient to determine the nasal percept and takes precedence over the structure of the nasal prototype. Thus, these results do not show strong evidence for auditory scene analysis applied to formants.
Virtual reality (VR) can create safe, cost-effective, and engaging learning environments. It is commonly assumed that improvements in simulation fidelity lead to better learning outcomes. Some aspects of real environments, for example vestibular or haptic cues, are difficult to recreate in VR, but VR offers a wealth of opportunities to provide additional sensory cues in arbitrary modalities that provide task relevant information. The aim of this study was to investigate whether these cues improve user experience and learning outcomes, and, specifically, whether learning using augmented sensory cues translates into performance improvements in real environments. Participants were randomly allocated into three matched groups: Group 1 (control) was asked to perform a real tyre change only. The remaining two groups were trained in VR before performance was evaluated on the same, real tyre change task. Group 2 was trained using a conventional VR system, while Group 3 was trained in VR with augmented, task relevant, multisensory cues. Objective performance, time to completion and error number, subjective ratings of presence, perceived workload, and discomfort were recorded. The results show that both VR training paradigms improved performance for the real task. Providing additional, task-relevant cues during VR training resulted in higher objective performance during the real task. We propose a novel method to quantify the relative performance gains between training paradigms that estimates the relative gain in terms of training time. Systematic differences in subjective ratings that show comparable workload ratings, higher presence ratings and lower discomfort ratings, mirroring objective performance measures, were also observed. These findings further support the use of augmented multisensory cues in VR environments as an efficient method to enhance performance, user experience and, critically, the transfer of training from virtual to real environment scenarios.