Existing objective speech-intelligibility measures are suitable for several types of degradation, however, it turns out that they are less appropriate in cases where noisy speech is processed by a time-frequency weighting. To this end, an extensive evaluation is presented of objective measure for intelligibility prediction of noisy speech processed with a technique called ideal time frequency (TF) segregation. In total 17 measures are evaluated, including four advanced speech-intelligibility measures (CSII, CSTI, NSEC, DAU), the advanced speech-quality measure (PESQ), and several frame-based measures (e.g., SSNR). Furthermore, several additional measures are proposed. The study comprised a total number of 168 different TF-weightings, including unprocessed noisy speech. Out of all measures, the proposed frame-based measure MCC gave the best results (ρ = 0.93). An additional experiment shows that the good performing measures in this study also show high correlation with the intelligibility of single-channel noise reduced speech.
The sound field in a room can be represented by a weighted sum of room modes. To estimate the room modes, current literature uses on-the-grid, sparse reconstruction methods. However, these on-the-grid methods are known to suffer from basis mismatch. In this work, we investigate the use of a gridless framework for estimating room modes using atomic norm minimization, a gridless method. The advantage of this approach would be that it does not suffer from this basis mismatch problem. We derive a bound for the sound field reconstruction problem in a one-dimensional room with rigid walls and relate this to the frequency separation that is required by the atomic norm. We conclude that for perfect reconstruction based on the investigated gridless approach, additional prior knowledge about the signal model is required. We show how recovery is possible in a one-dimensional setting by exploiting both the structure of the sound field and the acquisition method.
During the last decade there has been an increasing interest in mobile speech processing applications such as voice controlled devices, mobile telephony, smart phone applications, hearing aids, etc. As these applications have gained popularity, users expect them to work anywhere and at any time. This, obviously, imposes heavy demands on the robustness of these devices. Specifically, the user's environment may present acoustical disturbances like passing cars, trains, competing speakers, office noises, etc., in addition to the target (speech) source of interest. The impact of these disturbances can be severe, for example for mobile telephony and voice controlled devices, but also for the hearing aid case, where limitations of an impaired auditory system prevent the user from separating target speech from disturbance.
The recently proposed relaxed binaural beamforming (RBB) optimization problem provides a flexible trade-off between noise suppression and binaural-cue preservation of the sound sources in the acoustic scene. It minimizes the output noise power, under the constraints which guarantee that the target remains unchanged after processing and the binaural-cue distortions of the acoustic sources will be less than a user-defined threshold. However, the RBB problem is a computationally demanding non-convex optimization problem. The only existing suboptimal method which approximately solves the RBB is a successive convex optimization (SCO) method which, typically, requires to solve multiple convex optimization problems per frequency bin, in order to converge. Convergence is achieved when all constraints of the RBB optimization problem are satisfied. In this paper, we propose a semi-definite convex relaxation (SDCR) of the RBB optimization problem. The proposed suboptimal SDCR method solves a single convex optimization problem per frequency bin, resulting in a much lower computational complexity than the SCO method. Unlike the SCO method, the SDCR method does not guarantee user-controlled upper-bounded binaural-cue distortions. To tackle this problem we also propose a suboptimal hybrid method which combines the SDCR and SCO methods. Instrumental measures combined with a listening test show that the SDCR and hybrid methods achieve significantly lower computational complexity than the SCO method, and in most cases better trade-off between predicted intelligibility and binaural-cue preservation than the SCO method.
In this paper, we perceptually evaluate two recently proposed binaural multi-microphone speech enhancement methods in terms of intelligibility improvement and binaural-cue preservation. We compare these two methods with the well-known binaural minimum variance distortionless response (BMVDR) method. More specifically, we measure the 50% speech reception threshold, and the localization error of all dominant point sources in three different acoustic scenes. The listening tests are divided into a parameter selection phase and a testing phase. The parameter selection phase is used to select the algorithms' parameters based on one acoustic scene. In the testing phase, the two methods are evaluated in two other acoustic scenes in order to examine their robustness. Both methods achieve significantly better intelligiblity compared to the unprocessed scene, and slightly worse intelligibility than the BMVDR method. However, unlike the BMVDR method which severely distorts the binaural cues of all interferers, the new methods achieve localization errors which are not significantly different compared to those of the unprocessed scene.
Sinusoidal coding of an audio subject to a bit-rate constraint, in general, results in a noise-like residual signal. This residual signal is of high perceptual importance; reconstruction of audio using the sinusoidal representation only typically results in an artificial sounding reconstruction. We present a new method, called perceptual linear predictive coding (PLPC), where the residual is encoded by applying LPC in the perceptual domain. This method minimizes a perceptual modelling error and therefore represents only residual components that are of perceptual relevance, while automatically discarding components masked by the sinusoidally coded part. Subjective listening tests show that PLPC performs significantly better than ordinary LPC as a sinusoidal residual coding technique. Furthermore, PLPC combined with a flexible segmentation and model order allocation algorithm leads to a significant gain in terms of R/D performance for fragments with fast changing characteristics.
Training blind children to use audio-based navigation is a demanding and risky task, as children can walk into objects and hurt themselves. Furthermore, training outdoors is dangerous due to traffic, noise and weather conditions. Having a controlled indoor environment is safer but not always available. To tackle this problem, we developed an audio-based computer game, Legend of Iris (LOI), specifically designed to train navigation skills. The game is a 3D exploration game, which uses the headtracking capabilities of the Oculus Rift to create an immersive experience, and the new sound libraries AstoundSound and Phonon3D, to generate an accurate and realistic soundscape. These libraries use a head-related transfer function, allowing the player to localize the audio source in 3D space. The design of LOI involved selecting sounds that are easily recognizable to provide cues to blind people playing the game. A subset of these cues were incorporated into the game. To verify the effectiveness of the game in developing audio orientation and navigation skills, we performed a preliminary qualitative experiment with blind children in a dedicated school. LOI scored high in terms of accuracy and immersion, but a larger test is required to make statistical conclusions.
Most DFT domain based speech enhancement methods are dependent on an estimate of the noise power spectral density (PSD). For non-stationary noise sources it is desirable to estimate the noise PSD also in spectral regions where speech is present. In this paper a new method for noise tracking is presented, based on eigenvalue decompositions of correlation matrices that are constructed from time series of noisy DFT coefficients. The presented method can estimate the noise PSD at time-frequency points where both speech and noise are present. In comparison to state-of-the-art noise tracking algorithms the proposed algorithm reduces the estimation error between the estimated and the true noise PSD and improves segmental SNR when combined with an enhancement system with several dB. Index Terms: Speech enhancement, noise tracking, DFT domain subspace decompositions.
framework of the IEEE, of members with principal professional interest in the technology of transmission, recording, reproduction, processing, and measurement of speech and other signals by digital electronic, electrical, acoustic, mechanical, and optical means, the components and systems to accomplish these and related aims, and the environmental, psychological, and physiological factors concerned therewith.