Hierarchically Attending Time-Frequency and Channel Features for Improving Speaker Verification

2021 
Attention-based models have recently shown powerful representation learning ability in speaker recognition. However, most of the attention mechanism based models primarily focus on pooling layers. In this work, we present an end-to-end speaker verification system which leverage time-frequency and channel features hierarchically. To further improve system performance, we employ Large Margin Cosine Loss to optimize the model to determine the optimal loss function. We carry out experiments on the VoxCeleb1 datasets to evaluate the effectiveness of our methods. The results suggest that our best system outperforms the i-vector + PLDA and x-vector system by 53.3% and 7.6%, respectively.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    0
    Citations
    NaN
    KQI
    []