Review of different robust x-vector extractors for speaker verification

2021 
Recently, the x-vector framework, extracted with deep neural network architectures, became the state-of-the-art method for speaker verification. Although another level of performance has been overcome with this approach, fine-tuning and optimizing the hyper-parameters of a deep neural network to obtain a robust x-vector extractor is cost- and time-consuming. Several approaches have been proposed to train robust x-vector extractors. In this paper, we propose to review and analyse the impact of the most significant x-vector related approaches, including variations in terms of data augmentation, number of epochs, size of mini-batch, acoustic features and frames per iteration. By applying these approaches to the default recipe provided in the Kaldi toolkit, we observed a significant relative gain of more than 50% in terms of EER on Speaker in the Wild and Voxceleb1-E datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    2
    Citations
    NaN
    KQI
    []