Acoustic echo cancellation can be used to remove talker feedback in hands-free systems. Fast convergence and good tracking capabilities cannot be achieved by classical transform domain adaptive filtering algorithms when the reference signal has a variable rank autocorrelation matrix. During the low rank phases of the speech signal, some of the transform-domain tap coefficients become irrelevant to the adaptation process and stop adapting. When the autocorrelation matrix gains full rank, there will be no longer any “frozen” weights. In this paper, we focus on the DCTLMS algorithm and present a new method using a DCT based delay estimate from other coefficients to move the frozen weights closer to the optimal point and, consequently, reduce the overall re-convergence time.
This paper examines schemes that modify linear prediction (LP) analysis for speech signals. First, techniques which improve the conditioning of the LP equations are examined. White noise compensation for the correlations is justified from the point of view of reducing the range of values which the predictor coefficients take on. The efficacy of the procedure is measured over a large speech database. Various techniques for bandwidth expansion of the LP spectral peaks are also examined. These include lag windowing of the correlation, windowing of the predictor coefficients, and modification of the line spectral frequencies. New formulas for the bandwidth expansion factor are given.
In this paper, we evaluate and compare the robustness of several adaptive bit loading algorithms for multicarrier transmission systems, when imperfect subcarrier signal-to-noise ratio (SNR) information is used. In particular, we investigate the impact of the uncertainty of data-aided channel estimation techniques on system performance. We also examine an implementation issue associated with adaptive bit loading algorithms that use metrics related to the SNR. Although such metrics can be derived via closed form expressions, look-up tables are used instead to reduce system complexity, resulting in the SNR values being quantized. Thus, we examine the effects of SNR quantization on system performance. Finally, we present a technique for choosing SNR values in a fixed length look-up table in order to minimize quantization error.
A stream coding framework is presented for solving the distortion-constrained time-frequency dependent quantization problem that naturally arises when overlapped time-frequency decompositions are used. The main contributions of this paper are: (1) an efficient rate-distortion allocation algorithm for dependent quantization when the neighborhood of dependency is large; and (2) demonstration that a perceptual excitation distortion measure produces better coded audio quality than the conventional noise-to-mask ratio measure.
A general estimation model is defined in which two observations are available; one is a noisy version of the transmitted signal, while the other is a noisy filtered and delayed version of the same transmitted signal. The time-varying delay and the filter are unknown quantities that must be estimated. A joint estimator is proposed. It is composed of an adaptive delay element in conjunction with a transversal adaptive filter. The same error signal is used to adjust the delay element and the filter such that the minimum mean squared error is attained. Two joint gradient-based adaptation algorithms are studied. The joint steepest-descent (SD) algorithm is first investigated. The possibility of a multitude of stable solutions is established and a condition of convergence is presented. A stochastic implementation of the joint SD algorithm, under the form of a joint least-mean-square (LMS) algorithm, is then presented. It is analysed in terms of convergence in the mean and in the mean square of both the delay estimate and the adaptive filter weight vector estimate. The conditions of convergence of the joint LMS algorithm are established as a function of the power spectral densities of the observed signals and the minimum mean squared error.
There is a considerable performance gap between the current scalable audio coding schemes and a nonscalable coder operating at the same bitrate. This suboptimality results from the independent coding of the layers in these systems. One of the aspects that plays a role in this suboptimality is the entropy coding. In practical audio coding systems including MPEG advanced audio coding (AAC), the transform domain coefficients are quantized using an entropy-constrained quantizer. In MPEG-4 scalable AAC (S-AAC), the quantization and coding are performed separately at each layer. In case of Huffman coding, the redundancy introduced by the entropy coding at each layer is larger at lower quantization resolutions. Also, the redundancy for the overall coder becomes larger as the number of layers increases. In fact, there is a tradeoff between the overall redundancy and the fine-grain scalability in which the bitrate per layer is smaller and more layers are required. In this paper, a fine-grain scalable coder for audio signals is proposed where the entropy coding of a quantizer is made scalable via joint design of entropy coding and quantization. By constructing a Huffman-like coding tree where the internal nodes can be mapped to the reconstruction points, the tree can be pruned at any internal node to control the rate-distortion (RD) performance of the encoder in a fine-grain manner. A set of metrics and a trellis-based approach is proposed to create a coding tree so that an appropriate path is generated on the RD plane. The results show the proposed method outperforms the scalable audio coding performed based on reconstruction error quantization as used in practical systems, e.g., in S-AAC.
This article surveys approaches to teleconferencing in voice over IP networks. The considerations for conferencing include perceived quality, scalability, control, and compatibility. Architectures used for conferencing range from centralized bridges to full mesh. Centralized conference bridges used with compressed speech degrade speech quality when multiple talkers are mixed and subjected to tandem coding operations. Full mesh and multicast solutions (mixing at the end-points) are inappropriate when the number of conferees is large. This article discusses a hybrid solution that incorporates tandem-free bridging (the bridge selects and forwards packets) and endpoint mixing.
A stochastic tree coder based on the (M,L) search algorithm suggested by V. Iyengar and P. Kabal (1988) and a low-day CELP (code-excited linear prediction) coder proposed by J.H. Chen (1989) are considered. The individual components (predictors, gain adaptation, excitation coding) of the two coders are analyzed. The performances of the two types of coders are compared. The two coders have comparable performance at 16 kb/s under clean channel conditions. Methods to improve the performance of the coders, particularly with a view to bringing the bit rate to below 16 kb/s, are studied. Suggestions for improving the performance include an improved high-order predictor (applicable to both coders), and training of the excitation dictionary as well as a better gain adaptation strategy for the tree coder.< >
The authors try to identify the primary sources of distortion in a non-recursive time-scale modification (TSM) algorithm which is based on the short-time Fourier transform (STFT). A simpler version of this TSM algorithm is then proposed for processing speech, where incremental estimators eliminate the need for explicit linear time-scaling operations. Also featured in the design is a waveform structure compensation stage to prevent excessive deterioration of the rate-changed output. A polar (i.e., magnitude-phase) synthesis equation is used for increased efficiency. The TSM method is capable of generating high-quality rate-changed speech at a reasonable computational cost.< >
The bandwidth for telephony is generally defined to be from 300–3400 Hz. This bandwidth restriction has a noticeable effect on speech quality. We present an algorithm which recovers the missing highband parts from telephone speech. We describe an MMSE estimator using hard/soft-classification to create the missing highband spectrum envelope. The classification is motivated by acoustic phonetics: voiced vowels and consonants, and unvoiced phonemes demonstrate different characteristic spectra. The classification also captures gender differences. A hard classification on phoneme characteristic parameters, such as a voicing degree and a pitch lag, reduces the MMSE of the highband spectrum envelope estimates. An estimator using HMM-based softclassification can further bring down the estimated highband spectrum distortion by taking the time evolution of the spectra into consideration. Objective measures (mean log-spectrum distortion) and spectrograms confirm the improvement noted in informal subjective tests.