Semi-supervised Video Object Segmentation via Learning Object-aware Global-local Correspondence

2021 
In semi-supervised video object segmentation (VOS) task, temporal coherent object-level cues play a key role yet are hard to accurately model. To this end, this paper presents an object-aware global-local correspondence architecture, which enables to extract the inter-frame temporal coherent object-level features for accurate VOS. Specifically, we first generate a set of object masks by the ground-truth segmentation, and then we squeeze the current frame representation inside the object masks into a set of global object embeddings. Second, we compute the similarity between each embedding and the feature map, producing an object-aware weight for each pixel. The object-aware feature at each pixel is then constructed by summing the object embeddings weighted by their corresponding object-aware weights, which is able to capture rich object category information. Third, to establish the accurate correspondences between the inter-frame temporal coherent cues, we further design a novel global-local correspondence module to refine the temporal feature representations. Finally, we augment the object-aware features with the global-local aligned information to produce a strong spatio-temporal representation, which is essential to a more reliable pixel-wise segmentation prediction. Extensive evaluations are conducted on three popular VOS benchmarks containing Youtube-VOS, Davis2017 and Davis2016, demonstrating that the proposed method achieves favourable performance compared to the state-of-the-arts.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    1
    Citations
    NaN
    KQI
    []