Timescalenet : A Multiresolution Approach for Raw Audio Recognition
2019
In recent years, the use of Deep Learning techniques in audio signal processing has led the scientific community to develop machine learning strategies that allow to build efficient representations from raw waveforms for machine hearing tasks. In the present paper, we show the benefit of a multi-resolution approach : TimeScaleNet aims at learning an efficient representation of a sound, by learning time dependencies both at the sample level and at the frame level. At the sample level, TimeScaleNet’s architecture introduces a new form of recurrent neural layer that acts as a learnable passband biquadratic digital IIR filterbank and self-adapts to the specific recognition task and dataset, with a large receptive field and very few learnable parameters. The obtained frame-level feature map is then processed using a residual network of depthwise separable atrous convolutions. This second scale of analysis allows to encode the time fluctuations at the frame timescale, in different learnt pooled frequency bands. In the present paper, TimeScaleNet is tested using the Speech Commands Dataset. We report a very high mean accuracy of 94.87 ± 0.24% (macro averaged F1-score : 94.9 ± 0.24%) for this particular task.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
19
References
2
Citations
NaN
KQI