Multimodal Representation Learning via Maximization of Local Mutual Information

Ruizhi Liao,Daniel Moyer,Miriam Cha,Keegan Quigley,Seth A. Berkowitz,Steven Horng,Polina Golland,William M. Wells

Multimodal Representation Learning via Maximization of Local Mutual Information

2021

Ruizhi Liao
Daniel Moyer
Miriam Cha
Keegan Quigley
Seth A. Berkowitz
Steven Horng
Polina Golland
William M. Wells

We propose and demonstrate a representation learning approach by maximizing the mutual information between local features of images and text. The goal of this approach is to learn useful image representations by taking advantage of the rich information contained in the free text that describes the findings in the image. Our method trains image and text encoders by encouraging the resulting representations to exhibit high local mutual information. We make use of recent advances in mutual information estimation with neural network discriminators. We argue that the sum of local mutual information is typically a lower bound on the global mutual information. Our experimental results in the downstream image classification tasks demonstrate the advantages of using local features for image-text representation learning.

Keywords:

Encoder
Artificial intelligence
Machine learning
Maximization
Upper and lower bounds
Feature learning
Artificial neural network
Contextual image classification
Computer science
Mutual information
Image (mathematics)

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations