The frequency difference limen for steady-state formants was estimated many years ago [J. L. Flanagan, J. Acoust. Soc. Am. 27, 613–617 (1955)] and more recently [D. Kewley-Port and C. S. Watson, J. Acoust. Soc. Am. 95, 485–496 (1994)]. It was not generally realized, however, that the perceptual tolerance of transitions is somewhat greater [W. A. Ainsworth, Proc. 13th Int. Congr. Phon. Sci., Stockholm, pp. 837–840 (1995)] at least for vowel–vowel pairs. There appears to be a perceptual continuum from simple sounds (tone glides) to more complex, speechlike sounds [A. van Wieringen and L. Pols, J. Acoust. Soc. Am. 98, 1304–1312 (1995)]. In the present series of experiments the perceptual tolerance of the center of F1 and F2 vowel–vowel transitions was estimated. It was found that this tolerance is much reduced if the transition is not completed. This tolerance is also reduced if the center transition frequencies of bothF 1 and F2 were varied simultaneously.
A modified form of the Fourier transform, the reassigned Fourier transform (RAFT), uses phase as well as magnitude information. Its properties are shown to be superior for both pure-tone and speech signal analysis. The RAFT technique has the potential to deliver higher resolution spectrograms than the fast Fourier transform (FFT) for a given signal analysis window. The reassignment of energy with respect to both time and frequency, such that the reassignments model the time–frequency fluctuations of the sampled signal, allows the RAFT to deliver higher resolution than the FFT (with the FFT, spectral/temporal analyses place all of the energy at points lying in the center of either a time/frequency sample-window, respectively). In particular, the improved spectral resolution of the RAFT for a given signal analysis window size is compared with that of the FFT for a fundamental speech processing application: pitch tracking. The results indicate that use of the RAFT instead of the FFT allows shorter signal analysis windows to be used. The pitch tracking performance of the RAFT with short analysis windows (e.g., 25 and 12 ms) is better than the FFT. [Work supported by the UK’s EPSRC.]
The auditory system, in tandem with higher cognitive centers of the brain, performs a remarkable job of converting physical pressure variation into a sequence of meaningful elements comprising spoken language. The spectra of nonvocalic sounds, such as stop consonants, affricates, and fricatives, differ from vowels in a number of ways that are potentially important for how they are encoded in the periphery of the auditory pathway. Within the "standard" theory of speech perception, the auditory system's role is viewed primarily as encoding the spectrum, time frame by time frame. Far more is involved in decoding the speech signal than merely computing a conventional frequency analysis. Segmentation is a topic rarely discussed in audition, yet it is of profound importance for speech processing. The speech signal has been viewed more through the eyes of the vocal tract than the ear.
A model that is able to predict human performance in a simultaneous glide recognition task is described. The model combines a primitive, F0 guided, segregation stage and a schema driven stage with a heuristic that models whether listeners perceive a single or two simultaneous sounds. Introduction Previous studies [1,2,3] suggest that human listeners use simple cues, such as signal harmonicity, speaker location or segmental onset and offset to aid in the segregation of simultaneous sounds. These cues are called ‘primitive’ grouping cues because they can be applied without prior knowledge. The only heuristic is that segments in an ‘auditory scene’ that share the same features are likely to be produced by the same speaker. In addition to the primitive segregation process human listeners use high-level knowledge, schemata, to deal with mixtures of sounds [1]. One of the most intensively studied primitive grouping cues is harmonicity. Figure 1 shows human performance for a recognition task involving simultaneous vowels. Each of the panels shows the percentage of pairs that listeners correctly recognise. The stimuli were pairs of the French long vowels /#,G,K,Q,W,[/. One of the vowels always had a fundamental frequency (F0) of 100Hz, the fundamental frequency of the second vowel is plotted along the x-axis. The only primitive segregation cue is the vowel fundamental frequency. The three panels show subject performance for vowels of 200ms, 100ms and 50ms duration. For signals of at least 100ms duration subject performance improves significantly as the frequency difference between the vowels increases. If signals are only 50ms long no improvement in performance is seen. The perceptual data is surprising considering the dynamic nature of speech sounds where stationary segments of more than 100ms duration are very rare. Another important feature that emerges from the data is that humans are able to recognise both constituents of a pair in around 65% of all cases independent of the signal duration and without any grouping cues. 100106112 126 2nd vowel F0 40 50 60 70 80 90 % p ai rs c or re ct 204.8ms 100106112 126 2nd vowel F0 40 50 60 70 80 90 Human Performance
High-resolution spectral estimation is important for speech analysis. The conventional Fourier transform technique cannot offer sufficient resolution for speech signals of short durations. The reassignment method has been recently developed to improve the resolution of the Fourier spectrum. The basic idea of the reassignment method is to assign the value of the Fourier spectrum to the gravity center of the region rather than the geometric center of the region. However, the characteristics of the reassigned Fourier spectrum have not been fully understood. This paper is concerned with a detailed investigation of the resolution capability of the reassigned Fourier spectrum. The factors affecting resolution, such as frequency separation and relative amplitude, are studied. The minimal frequency separation which can be resolved in the reassigned Fourier spectrum is determined. As a result, it shows that the reassigned Fourier spectrum has better resolution capability in comparison to the Fourier spectrum. Furthermore, the study shows that the frequency estimator of closely spaced sinusoids has a bias. The bias is due to the interaction of the two sinusoids under the assignment operation. The expression of the bias is derived and shows that the bias can be ignored for most speech signals.
The fundamental process of auditory scene analysis is the organization of elementary acoustic features in a complex auditory scene into grouped meaningful auditory streams. There are two important issues which need to be addressed for modeling auditory scene analysis. The first issue is concerned with the representation of elementary acoustic features, whilst the second issue is related to the binding mechanism. This paper presents a neural model for auditory scene analysis in which a two-dimensional amplitude modulation (AM) map is used to represent elementary acoustic features and the synchronization of neural oscillators is adopted as the binding mechanism. The AM map captures the modulation frequencies of sound signals filtered by an auditory filterbank. Since the modulation frequencies are the F0-related features for voiced speech signals, F0-based segregation can be utilized to group the auditory streams. The grouping of F0-related features is attained as the formation of the synchronization of nonlinear neural oscillators. Each oscillator is associated with a certain modulation frequency. A set of oscillators are synchronized only when their associated frequencies are harmonically related. The proposed model is tested on synthetic double-vowel identification and the results are in accordance with psychophysical data.