Video Saliency Prediction based on Spatial-temporal Two-stream Network

2018 
In this paper, we propose a novel two-stream neural network for video saliency prediction. Unlike some traditional methods developed based on hand-crafted feature extraction and integration, our proposed method automatically learns saliency related spatiotemporal features from human fixations without any pre-processing, post-processing or manual tuning. Video frames are routed through the spatial stream network to compute static or color saliency maps for each of them. And a new two-stage temporal stream network is proposed, which is composed by a pre-trained 2D-CNN model (SF-Net) to extract saliency related features and a shallow 3D-CNN model (Te-Net) to process these features, for temporal or dynamic saliency maps. It can reduce requirement of video gaze data, improve training efficiency and achieve high performance. A fusion network is adopted to combine the outputs of both streams and generate the final saliency maps. Besides, a Convolutional Gaussian Priors (CGP) layer is proposed to learn the bias phenomenon in viewing behavior to improve the performance of video saliency prediction. The proposed method is compared with state-of-the-art saliency models on two public video saliency benchmark datasets. Results demonstrate that our model can achieve advanced performance on video saliency prediction.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    64
    References
    22
    Citations
    NaN
    KQI
    []