Temporal context improves automatic recognition of call sequences in soundscape data

2020 
Convolutional neural networks (CNNs) are commonly employed for detecting animal vocalizations. We explored whether use of the temporal patterns in song notes can improve recognition. Fin whales (Balaenoptera physalus) produce sequences of low-frequency, down-swept calls (20 Hz pulses) over many minutes. Timing between calls can be exploited to improve detection. We trained a base CNN model to detect 20 Hz pulses in 4 s audio segments. Then, we trained three variants of long short-term memory (LSTM) networks to process sequences produced by the CNN. In the first, the inputs to the LSTM were the scalar prediction scores from the CNN. The second examined sequences of features produced by the CNN before classification. The third combined the feature vectors and scores produced by the CNN. We conducted cross-validation experiments on recordings from the Southern California Bight collected between 2008 and 2014. All three variants outperformed the CNN. The precision-recall (PR) curves of the hybrid models dominated that of the base model, with improvements of 8%–13% in both peak F1-score and area under PR-curve. The second and third hybrid variants performed better than the first. CNN-LSTM hybrid models efficiently improve recognition of call sequences by incorporating temporal context.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []