Speaker-discriminative Embedding Learning via Affinity Matrix for Short Utterance Speaker Verification

2019 
Text-independent short utterance speaker verification (TI-SUSV) task remains more challenging compared to the full-length utterance SV task due to inaccurately estimated feature statistics or insufficient distinguishable speaker embeddings. It is noted that recently developed end-to-end SV systems (E2E-SV) achieve the state-of-the-art on several datasets, which directly learn a mapping from speech features to the compact fixed length speaker embeddings. In this study, following the E2E-SV pipeline, we strive to further improve the accuracy of TI-SUSV task. Our research is based on two intuitive ideas: better speech feature representation for SUs and better training loss function to obtain more discriminative embeddings. Specifically, a bidirectional gated recurrent unit network with residual connection (Res-BGRU) is firstly designed to improve feature representation capability. Secondly, a novel affinity loss is proposed where the mini-batch data has been manipulated to obtain more supervision information. In details, a speaker identity affinity matrix formed by one-hot speaker identity vectors is taken as the supervisor of the speaker embedding affinity matrix to obtain better inter-speaker separability and intra-speaker compactness. Experimental results on the Voxceleb1 dataset show that our system outperforms a conventional i-vector and x-vector system on TI-SUSV.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    1
    Citations
    NaN
    KQI
    []