Deep People Detection: A Comparative Study of SSD and LSTM-decoder

2018 
In this paper, we present a comparative study of two state-of-the-art object detection architectures - an end-to-end CNN-based framework called SSD [1] and an LSTM-based framework [2] which we refer to as LSTM-decoder. To this end, we study the two architectures in the context of people head detection on few benchmark datasets having small to moderately large number of head instances appearing in varying scales and occlusion levels. In order to better capture the pros and cons of the two architectures, we applied them with several deep feature extractors (e.g., Inception-V2, Inception-ResNet-V2 and MobileNet-V1) and report accuracy, speed and generalization ability of the approaches. Our experimental results show that while the LSTM-decoder can be more accurate in realizing smaller head instances especially in the presence of occlusions, the sheer detection speed and superior ability to generalize over multiple scales make SSD an ideal choice for real-time people detection.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    3
    Citations
    NaN
    KQI
    []