OBJECT COUNTS! BRINGING EXPLICIT DETECTIONS BACK INTO IMAGE CAPTIONING
2018
The use of explicit object detectors as an intermediate
step to image captioning – which
used to constitute an essential stage in early
work – is often bypassed in the currently dominant
end-to-end approaches, where the language
model is conditioned directly on a midlevel
image embedding. We argue that explicit
detections provide rich semantic information,
and can thus be used as an interpretable representation
to better understand why end-to-end
image captioning systems work well. We provide
an in-depth analysis of end-to-end image
captioning by exploring a variety of cues that
can be derived from such object detections.
Our study reveals that end-to-end image captioning
systems rely on matching image representations
to generate captions, and that encoding
the frequency, size and position of objects
are complementary and all play a role in
forming a good image representation. It also
reveals that different object categories contribute
in different ways towards image captioning.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
29
References
32
Citations
NaN
KQI