Revisiting the Statistics Pooling Layer in Deep Speaker Embedding Learning
2021
The pooling function plays a vital role in the segment-level deep speaker embedding learning framework. One common method is to calculate the statistics of the temporal features, while the mean based temporal average pooling (TAP) and temporal statistics pooling (TSTP) which combine mean and standard deviation are two typical approaches. Empirically, researchers observe a big performance degradation in x-vector when removing the standard deviation. Based on this observation, in this paper, we designed a set of experiments to analyze the effectiveness of different statistics quantitatively, including the investigation and comparison on pooling functions based on standard deviation, covariance and l p -norm. Experiments are carried out on Vox-Celeb and SRE16, and the results show that the second-order statistics based pooling functions yield better performance than TAP, and only the simple standard deviation can achieve the best performance on all the evaluation conditions.
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
28
References
3
Citations
NaN
KQI