In a collaborative research project, several monaural and binaural noise reduction algorithms have been comprehensively evaluated. In this article, eight selected noise reduction algorithms were assessed using instrumental measures, with a focus on the instrumental evaluation of speech intelligibility. Four distinct, reverberant scenarios were created to reflect everyday listening situations: a stationary speech-shaped noise, a multitalker babble noise, a single interfering talker, and a realistic cafeteria noise. Three instrumental measures were employed to assess predicted speech intelligibility and predicted sound quality: the intelligibility-weighted signal-to-noise ratio, the short-time objective intelligibility measure, and the perceptual evaluation of speech quality. The results show substantial improvements in predicted speech intelligibility as well as sound quality for the proposed algorithms. The evaluated coherence-based noise reduction algorithm was able to provide improvements in predicted audio signal quality. For the tested single-channel noise reduction algorithm, improvements in intelligibility-weighted signal-to-noise ratio were observed in all but the nonstationary cafeteria ambient noise scenario. Binaural minimum variance distortionless response beamforming algorithms performed particularly well in all noise scenarios.
With the advancement of technology, both assisted listening devices and speech communication devices are becoming more portable and also more frequently used. As a consequence, users of devices such as hearing aids, cochlear implants, and mobile telephones, expect their devices to work robustly anywhere and at any time. This holds in particular for challenging noisy environments like a cafeteria, a restaurant, a subway, a factory, or in traffic. One way to making assisted listening devices robust to noise is to apply speech enhancement algorithms. To improve the corrupted speech, spatial diversity can be exploited by a constructive combination of microphone signals (so-called beamforming), and by exploiting the different spectro?temporal properties of speech and noise. Here, we focus on single-channel speech enhancement algorithms which rely on spectrotemporal properties. On the one hand, these algorithms can be employed when the miniaturization of devices only allows for using a single microphone. On the other hand, when multiple microphones are available, single-channel algorithms can be employed as a postprocessor at the output of a beamformer. To exploit the short-term stationary properties of natural sounds, many of these approaches process the signal in a time-frequency representation, most frequently the short-time discrete Fourier transform (STFT) domain. In this domain, the coefficients of the signal are complex-valued, and can therefore be represented by their absolute value (referred to in the literature both as STFT magnitude and STFT amplitude) and their phase. While the modeling and processing of the STFT magnitude has been the center of interest in the past three decades, phase has been largely ignored.
This report presents our audio event detection system submitted for Task 2, "Detection of rare sound events", of DCASE 2017 challenge. The proposed system is based on convolutional neural networks (CNNs) and deep neural networks (DNNs) coupled with novel weighted and multi-task loss functions and state-of-the-art phase-aware signal enhancement. The loss functions are tailored for audio event detection in audio streams. The weighted loss is designed to tackle the common issue of imbalanced data in background/foreground classification while the multi-task loss enables the networks to simultaneously model the class distribution and the temporal structures of the target events for recognition. Our proposed systems significantly outperform the challenge baseline, improving F-score from 72.7% to 90.0% and reducing detection error rate from 0.53 to 0.18 on average on the development data. On the evaluation data, our submission obtains an average F1-score of 88.3% and an error rate of 0.22 which are significantly better than those obtained by the DCASE baseline (i.e. an F1-score of 64.1% and an error rate of 0.64).
Among the most commonly used single-channel approaches for the enhancement of noise corrupted speech are Bayesian estimators of clean speech coefficients in the short-time Fourier transform domain. However, the vast majority of these approaches effectively only modifies the spectral amplitude and does not consider any information about the clean speech spectral phase. More recently, clean speech estimators that can utilize prior phase information have been proposed and shown to lead to improvements over the traditional, phase-blind approaches. In this work, we revisit phase-aware estimators of clean speech amplitudes and complex coefficients. To complete the existing set of estimators, we first derive a novel amplitude estimator given uncertain prior phase information. Second, we derive a closed-form solution for complex coefficients when the prior phase information is completely uncertain or not available. We put the novel estimators into the context of existing estimators and discuss their advantages and disadvantages.
For the enhancement of single-channel speech corrupted by acoustic noise, recently short-time Fourier transform domain clean speech estimators were proposed that incorporate prior information about the clean speech spectral phase. Instrumental measures predict quality improvements for the phase-aware estimators over their conventional phase-blind counterparts. In this letter, these predictions are verified by means of listening experiments. The phase-aware amplitude estimator on average achieves a stronger noise reduction and is significantly preferred over its phase-blind counterpart in a pairwise comparison even if the clean spectral phase is estimated blindly on the noisy signal.
Many well-known and frequently employed Bayesian clean speech estimators have been derived under the assumption that the true power spectral densities (PSDs) of speech and noise are exactly known. In practice, however, only power spectral density (PSD) estimates are available. Simply neglecting PSD estimation errors and handling the estimates as true values leads to speech estimation errors causing musical noise and undesired suppression of speech. In this paper, the uncertainty of the available speech PSD estimates is addressed. The main contributions are the following. First, we summarize and examine ways to model and incorporate the uncertainty of PSD estimates for a more robust speech enhancement performance. Second, a novel nonlinear clean speech estimator is derived that takes into account prior knowledge about the absolute value of typical speech PSDs. Third, we show that the derived statistical framework provides uncertainty-aware counterparts to a number of well-known conventional clean speech estimators such as the Wiener filter and Ephraim and Malah's amplitude estimators. Fourth, we show how modern PSD estimators can be incorporated into the theoretical framework and propose to employ frequency dependent priors. Finally, the effects and benefits of considering the uncertainty of speech PSD estimates are analyzed, discussed, and evaluated via instrumental measures and a listening experiment.
For the reduction of additive acoustic noise, various methods and clean speech estimators are available, with specific strengths and weaknesses. In order to combine the strengths of two such approaches, we derive a minimum mean squared error (MMSE)-optimal estimator of the clean speech given two independent initial clean speech estimates. As an example we present a specific combination that results in a weighted mixture of the Wiener filter and a simple, low-cost harmonic speech model. The proposed estimator benefits from the additional information provided by the harmonic model, leading to a better protection of harmonic components of voiced speech as compared to the traditional Wiener filter. Instrumental measures predict improvements in speech quality and speech intelligibility for the proposed combination over each individual estimator.
Future industrial control systems face the need for being highly adaptive, productive, and efficient, yet providing a high level of safety towards operating staff, environment, and machinery. These demands call for the joint consideration of resilience and mixed criticality to exploit previously untapped redundancy potentials. Hereby, resilience combines detection, decision-making, adaption to, and recovery from unforeseeable or malicious events in an autonomous manner. Enabling the consideration of functionalities with different criticalities, mixed criticality allows prioritizing safety-relevant over uncritical functions. While both concepts on their own feature a huge research branch throughout various disciplines of engineering-related fields, the synergies of both paradigms in a multi-disciplinary context are commonly overlooked. In industrial control, consolidating these mechanisms while preserving functional safety requirements under limited resources is a significant challenge. In this contribution, we provide a multi-disciplinary perspective of the concepts and mechanisms that enable criticality-aware resilience, in particular with respect to system design, communication, control, and security. Thereby, we envision a highly flexible, autonomous, and scalable paradigm for industrial control systems, identify potentials along the different domains, and identify future research directions. Our results indicate that jointly employing mixed criticality and resilience has the potential to increase the overall systems efficiency, reliability, and flexibility, even against unanticipated or malicious events. Thus, for future industrial systems, mixed criticality-aware resilience is a crucial factor towards autonomy and increasing the overall system performance.
Conventional statistical clean speech estimators, like the Wiener filter, are frequently used for the spectro-temporal enhancement of noise corrupted speech. Most of these approaches estimate the clean speech independently for each time-frequency point, neglecting the structure of the underlying speech sound. In this work, we derive a statistical estimator that explicitly takes into account information about the characteristic structure of voiced speech by means of a harmonic signal model. To this end, we also present a way to estimate a harmonic model-based clean speech representation and the corresponding error variance directly in the short-time Fourier transform domain. The resulting estimator is optimal in the minimum-mean-squared error sense and can conveniently be formulated in terms of a multichannel Wiener filter. The proposed estimator outperforms several reference algorithms in terms of speech quality and intelligibility as predicted by instrumental measures.