This paper introduces a novel speech coder structure for storage applications operating at low bit rates. The coder exploits the inherent segmental nature of speech signals by dividing the input into segments of variable length. Quite often the length of the segment is the same as the length of the phoneme. The individual segments are coded using adaptive techniques that take into account the relative perceptual importance of different types of speech, e.g. voiced and unvoiced speech. These main features of the proposed approach are enabled by the fact that many of the design constraints related to real-time conversational speech can be relaxed in storage applications. A practical implementation containing the speech-adaptive segmentation is described and its performance is verified in a listening test at average bit rates of about 1.0 kbps and 2.4 kbps respectively. The results show that the segmental model significantly improves the coding efficiency.
Current speech synthesis efforts, both in research and in applications, are dominated by methods based on concatenation of spoken units. New progress in the concatenative text-to-speech (TTS) technology can be made mainly from two directions, either by reducing the memory footprint to integrate the system into embedded system, or by improving the synthesized speech quality in terms of intelligibility and naturalness. In this paper, we are focusing on the memory footprint reduction in a Mandarin TTS system. We show that significant memory reductions can be achieved through duration modeling and memory optimization of the lexicon data. The results obtained in the experiments indicate that the memory requirements of the duration data and lexicon can be significantly reduced while keeping the speech quality unaffected. For practical embedded implementations, this is a significant step towards an efficient TTS engine implementation. The applicability of the approach is verified in the speech synthesis system.
Voice conversion aims at converting speech from one speaker to sound as if it was spoken by another specific speaker. The most popular voice conversion approach based on Gaussian mixture modeling tends to suffer either from model overfitting or oversmoothing. To overcome the shortcomings of the traditional approach, we recently proposed to use dynamic kernel partial least squares (DKPLS) regression in the framework of parallel-data voice conversion. However, the availability of parallel training data from both the source and target speaker is not always guaranteed. In this paper, we extend the DKPLS-based conversion approach for non-parallel data by combining it with a well-known INCA alignment algorithm. The listening test results indicate that high-quality conversion can be achieved with the proposed combination. Furthermore, the performance of two variations of INCA are evaluated with both intra-lingual and cross-lingual data.
This paper takes phonetic information into account for data alignment in text-independent voice conversion. Hidden Markov models are used for representing the phonetic structure of training speech. States belonging to same phoneme are grouped together to form a phoneme cluster. A state mapped codebook based transformation is established using information on the corresponding phoneme clusters from source and targets speech and weighted linear transform. For each source vector, several nearest clusters are considered simultaneously while mapping in order to generate a continuous and stable transform. Experimental results indicate that the proposed use of phonetic information increases the similarity between converted speech and target speech. The proposed technique is applicable to both intra-lingual and cross-lingual voice conversion.
A masking model originally designed for audio signals is applied to narrowband speech. The model is used to detect and remove the perceptually irrelevant simultaneously masked frequency components of a speech signal. Objective measurements have shown that the modified speech signal can be coded more efficiently than the original signal. Furthermore, it has been confirmed through perceptual evaluation that the removal of these frequency components does not cause significant degradation of the speech quality but rather, it has consistently improved the output quality of two standardized speech codecs. Thus, the proposed irrelevancy removal technique can be used at the front end of a speech coder to achieve enhanced coding efficiency.
In voice conversion, speech and signal processing techniques are used for the modification of speaker identity, i.e. for modifying the speech of a source speaker to sound as if it was spoken by a target speaker. In this paper, we describe a parametric framework for voice conversion. The parametric representation separates the speech signal into a vocal tract contribution estimated using linear prediction and into an excitation signal modeled using a scheme based on sinusoidal modeling. This parametric framework is in line with the theory of human speech production and it also lends itself into very efficient compression. An initial version of the proposed voice conversion scheme has been implemented and evaluated in listening tests. The results show that the proposed approach offers a promising framework for voice conversion but further development work is still needed to reach its full potential.
Most of the current voice conversion systems model the joint density of source and target features using a Gaussian mixture model. An inherent property of this approach is that the source and target features have to be properly aligned for the training. It is intuitively clear that the accuracy of the alignment has some effect on the conversion quality but this issue has not been thoroughly studied in the literature. Examples of alignment techniques include the usage of a speech recognizer with forced alignment or dynamic time warping (DTW). In this paper, we study the effect of alignment on voice conversion quality through extensive experiments and discuss issues that should be considered. The main outcome of the study is that alignment clearly matters but with simple voice activity detection, DTW and some constraints we can achieve the same quality as with hand-marked labels.