Precise knowledge about the size of a crowd, its density and flow can provide valuable information for safety and security applications, event planning, architectural design and to analyze consumer behavior. Creating a powerful machine learning model, to employ for such applications requires a large and highly accurate and reliable dataset. Unfortunately the existing crowd counting and density estimation benchmark datasets are not only limited in terms of their size, but also lack annotation, in general too time consuming to implement. This paper attempts to address this very issue through a content aware technique, uses combinations of Chan-Vese segmentation algorithm, two-dimensional Gaussian filter and brute-force nearest neighbor search. The results shows that by simply replacing the commonly used density map generators with the proposed method, higher level of accuracy can be achieved using the existing state of the art models.
In this paper we present the design and evaluation of an end-to-end trainable, deep neural network with a visual attention mechanism for memorability estimation in still images. We analyze the suitability of transfer learning of deep models from image classification to the memorability task. Further on we study the impact of the attention mechanism on the memorability estimation and evaluate our network on the SUN Memorability and the LaMem datasets. Our network outperforms the existing state of the art models on both datasets in terms of the Spearman's rank correlation as well as the mean squared error, closely matching human consistency.
Multi-camera firsion is rapidly becoming an emerging research area, especially for visual surveillance applications. Data fusion can be obtained with calibrated cameras, either calibrating prior use - following standard techniques (I) - or through learning mechanism in 30 Cartesian frame (2), typically the scene ground plane. In this paper we describe a method to merge video data acquired by two overlapping views, by learning the camera registration on the basis of occurring dynamics. Scene dynamics in each independent view can be modeled as a mixture of Gaussian components, and that dynamics can be coupled assuming stochastic correlation between the underlying processes.
A real-time method that automatically creates a visual memory of a scene using the growing neural gas (GNG) algorithm is described. The memory consists of a graph where nodes encode the visual information of a video stream as a limited set of representative images. GNG nodes are automatically generated and dynamically clustered. This method could be employed by robotic platforms in exploratory and rescue missions.
A swarm robotics approach is adopted in designing a fully autonomous multi-robot-based solution to the problem of locating generic targets within a given search space. A proof of concept system is developed and tested within a 3D simulation environment. A series of laboratory experiments are carried out to assess the performance of the system with respect to the given task of localising and monitoring generic targets, with reference to a counter IED scenario. Further experiments are carried out to evaluate the robustness to robot failure and scalability of the system.
Video context analysis is an active and vibrant research area, which provides means for extracting, analyzing and understanding behavior of a single target and multiple targets. Over the last few deca
It is possible to model avatars that learn to simulate object manipulations and other complex actions. A number of applications may benefit from this technique including safety, ergonomics, film animation and many others. Current techniques control avatars manually, scripting what they can do by imposing constraints on their physical and cognitive model. In this paper we show how avatars in a controlled environment can learn behaviors as compositions of simple actions. The avatar learning process is described in detail for a generic behavior and tested in simple experiments. Local and global metrics are introduced to optimize the selection of a set of actions from the learnt pool. The performance for the learnt tasks is qualitatively compared with a human performance.
This paper addresses the problem of detecting and localizing abnormal activities in crowded scenes. A spatiotemporal Laplacian eigenmap method is proposed to extract different crowd activities from videos. This is achieved by learning the spatial and temporal variations of local motions in an embedded space. We employ representatives of different activities to construct the model which characterizes the regular behavior of a crowd. This model of regular crowd behavior allows the detection of abnormal crowd activities both in local and global contexts and the localization of regions which show abnormal behavior. Experiments on the recently published data sets show that the proposed method achieves comparable results with the state-of-the-art methods without sacrificing computational simplicity.