Parallel multi-head dot product attention for video summarization

2020 
The dominant video summarization deep learning models are based on recurrent or convolutional neural network with a complex architecures. The best performing models also use attention mechanism. We propose a novel method for supervised, keyframes based video summarization by applying a well known Transformer architecture. Current state of the art method based on self-attention netwok. These network is simple and robust but still lower than humman level. We propose model for video summarization based on a powerfull language architecture Transformer which performs a single feed forward and backward pass during training for the entire video sequence to frame score transformation. This approach was adapted for the sequential regression followed by binarized evaluation which allowed to compare predicted summary with human level. Experiments show that model is competitive in performance and have a good parallel training ability also. Our model achieves F 1 48.1 and 50.19 for SumMe in cannonical and augmented setting and 60.12 and 61.5 for TVSum.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    29
    References
    0
    Citations
    NaN
    KQI
    []