MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video

arXiv (Cornell University) (2021)

David Junhao Zhang Kunchang Li Yunpeng Chen Yali Wang Shashwat Chandra Yu Qiao Luoqi Liu Mike Zheng Shou

Citation

Reference

Related Paper

Abstract:

Self-attention has become an integral component of the recent network architectures, e.g., Transformer, that dominate major image and video benchmarks. This is because self-attention can flexibly model long-range information. For the same reason, researchers make attempts recently to revive Multiple Layer Perceptron (MLP) and propose a few MLP-Like architectures, showing great potential. However, the current MLP-Like architectures are not good at capturing local details and lack progressive understanding of core details in the images and/or videos. To overcome this issue, we propose a novel MorphMLP architecture that focuses on capturing local details at the low-level layers, while gradually changing to focus on long-term modeling at the high-level layers. Specifically, we design a Fully-Connected-Like layer, dubbed as MorphFC, of two morphable filters that gradually grow its receptive field along the height and width dimension. More interestingly, we propose to flexibly adapt our MorphFC layer in the video domain. To our best knowledge, we are the first to create a MLP-Like backbone for learning video representation. Finally, we conduct extensive experiments on image classification, semantic segmentation and video classification. Our MorphMLP, such a self-attention free backbone, can be as powerful as and even outperform self-attention based models.

Keywords:

Representation

Perceptron

Topics:

Brain Tumor Detection and Classification

Neural Networks and Applications

Visual Attention and Saliency Detection

Source

Cite

PDF

Animal

Elsevier eBooks (1999)

D. Louis Collins Alan C. Evans

10.1016/b978-012692535-7/50084-7

Cite

Citations (31)

Polyp Segmentation Method in Colonoscopy Videos by Means of MSA-DOVA Energy Maps Calculation

Lecture notes in computer science (2014)

Jorge Bernal Joan M. Nunez do Rio F. Javier Sánchez Fernando Vilariño

10.1007/978-3-319-13909-8_6

Cite

Citations (16)

Blur Estimation Methods for System of Audiovisual Monitoring of Meeting Participants

Lecture notes in computer science (2014)

Irina Vatamaniuk Andrey Ronzhin Andrey Ronzhin

10.1007/978-3-319-11581-8_18

Cite

Citations (0)

Human Body Region Extraction from Photos

Machine Vision and Applications (2007)

Yi Hu

Torso

Source

Cite

Citations (7)

Human Action Recognition Using Temporal Segmentation and Accordion Representation

Lecture notes in computer science (2013)

Manel Sekma Mahmoud Mejdoub Chokri Ben Amar

Accordion

Representation

Adjacency list

Action Recognition

10.1007/978-3-642-40246-3_70

Cite

Citations (15)

A Novel Topic-Level Random Walk Framework for Scene Image Co-segmentation

Lecture notes in computer science (2014)

Zehuan Yuan Tong Lü Palaiahnakote Shivakumara

Benchmark (surveying)

Representation

10.1007/978-3-319-10590-1_45

Cite

Citations (11)

An intelligent alarm system

International Conference on Intelligent Systems (1992)

Kevin William John Findlay D. Renshaw P.B. Denyer

The authors describe an alarm system which will detect and track a human moving around a scene from stationary cameras. The authors provide a summary of the system from the lower pixel based representation to the higher level edge grouping and stereo matching. Data flows from the three cameras into the initial segmentation modules before the stereo matching algorithm is applied. After stereo matching, between the three cameras, they utilise edge statistics and apply the disparity gradient limit. This allows a decision about a particular edges 'goodness' to be made. The disparity results are then extracted, considered in terms of recent frames, and analysed over time. The statistics from this module can then be used to alter the thresholds at lower levels of processing. The authors discuss calibration and accuracy, explain the reasons for using three cameras as opposed to two and give examples of the algorithms applied to trial sequences.

False alarm

Representation

Source

Cite

Citations (1)

Elimination of Distorted Images Using the Blur Estimation at the Automatic Registration of Meeting Participants

Lecture notes in computer science (2014)

Irina Vatamaniuk Andrey Ronzhin Anton Saveliev Andrey Ronzhin

Image registration

Gaussian blur

10.1007/978-3-319-10353-2_12

Cite