Fusing video and text data by integrating appearance and behavior similarity
2013
In this paper, we describe an algorithm for multi-modal entity co-reference resolution and present
experimental results using text and motion imagery data sources. Our model generates probabilistic
association between entities mentioned in text and detected in video data by jointly optimizing the measure of
appearance and behavior similarity. Appearance similarity is calculated as a match between propositionderived
entity attributes mentioned in text, and the object appearance classification from video sources. The
behavior similarity is calculated based on the semantic information about entity movements, actions, and
interactions with other entities mentioned in text and detected in video sources. Our model achieved 79% Fscore
for text-to-video entity co-reference resolution; we show that entity interactions present unique features
for resolving variability present in text data and ambiguity of visual appearance of entities.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
0
References
1
Citations
NaN
KQI