Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation.

Chen Liang,Yu Wu,Tianfei Zhou,Wenguan Wang,Zongxin Yang,Yunchao Wei,Yi Yang

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation.

2021

Chen Liang
Yu Wu
Tianfei Zhou
Wenguan Wang
Zongxin Yang
Yunchao Wei
Yi Yang

Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. Such bottom-up strategy fails to explore object-level cues, easily leading to inferior results. In this work, we instead put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently. Our model ranks first place on CVPR2021 Referring Youtube-VOS challenge.

Keywords:

Perspective (graphical)
Computer science
transformer
Natural language
Modal
Set (abstract data type)
Image (mathematics)
Artificial intelligence
Computer vision
Object (computer science)
Segmentation

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations