The human face is the most well-researched object in computer vision, mainly because (1) it is a highly deformable object whose appearance changes dramatically under different poses, expressions, and, illuminations, etc., (2) the applications of face recognition are numerous and span several fields, (3) it is widely known that humans possess the ability to perform, extremely efficiently and accurately, facial analysis, especially identity recognition. Although a lot of research has been conducted in the past years, the problem of face recognition using images captured in uncontrolled environments including several illumination and/or pose variations still remains open. This is also attributed to the existence of outliers (such as partial occlusion, cosmetics, eyeglasses, etc.) or changes due to age. In this chapter, the authors provide an overview of the existing fully automatic face recognition technologies for uncontrolled scenarios. They present the existing databases and summarize the challenges that arise in such scenarios and conclude by presenting the opportunities that exist in the field.
Hidden conditional random fields (HCRFs) are discriminative latent variable models that have been shown to successfully learn the hidden structure of a given classification problem (provided an appropriate validation of the number of hidden states). In this brief, we present the infinite HCRF (iHCRF), which is a nonparametric model based on hierarchical Dirichlet processes and is capable of automatically learning the optimal number of hidden states for a classification task. We show how we learn the model hyperparameters with an effective Markov-chain Monte Carlo sampling technique, and we explain the process that underlines our iHCRF model with the Restaurant Franchise Rating Agencies analogy. We show that the iHCRF is able to converge to a correct number of represented hidden states, and outperforms the best finite HCRFs--chosen via cross-validation--for the difficult tasks of recognizing instances of agreement, disagreement, and pain. Moreover, the iHCRF manages to achieve this performance in significantly less total training, validation, and testing time.
The design of neural network layers plays a crucial role in determining the efficiency and performance of various computer vision tasks. However, most existing layers compromise between fast feature extraction and reasoning abilities, resulting in suboptimal outcomes. In this paper, we propose a novel and efficient operator for representation learning that can dynamically adjust to the underlying data structure. We introduce a general Dynamic Fully-Connected (DFC) layer, a non-linear extension of a Fully-Connected layer that has a learnable receptive field, is instance-adaptive, and spatially aware. We propose to use CP decomposition to reduce the complexity of the DFC layer without compromising its expressivity. Then, we leverage Summed Area Tables and Modulation to create an adaptive receptive field that can process the input with constant complexity. We evaluate the effectiveness of our method on image classification and other downstream vision tasks using both hierarchical and isotropic architectures. Our results demonstrate that our method outperforms other commonly used layers by a significant margin while keeping a fixed computational budget, therefore establishing a new strategy to efficiently design neural architectures that can capture the multi-scale features of the input without increasing complexity.
Construction and fitting of Statistical Deformable Models (SDM) is in the core of computer vision and image analysis discipline. It can be used to estimate the object's shape, pose, parts and landmarks using only static imagery captured from monocular cameras. One of the first and most popular families of SDMs is that of Active Appearance Models. AAM uses a generative parameterization of object appearance and shape. The fitting process of AAMs is usually conducted by solving a non-linear optimization problem. In this talk I will start with a brief introduction to AAMs and I will continue with describing supervised methods for AAM fitting. Subsequently, under this framework, I will motivate current techniques developed in my group that capitalize on the combined power of Deep Convolutional Neural Networks (DCNN) and Recurrent NN (RNNs) for optimal deformable object modeling and fitting. Finally, I will show how we can extract dense shape of objects by building and fitting 3D Morphable Models. Examples will be given in the publicly available toolbox of my group called Menpo (http://www.menpo.org/).
The de facto algorithm for facial landmark estimation involves running a face detector with a subsequent deformable model fitting on the bounding box. This encompasses two basic problems: i) the detection and deformable fitting steps are performed independently, while the detector might not provide best-suited initialisation for the fitting step, ii) the face appearance varies hugely across different poses, which makes the deformable face fitting very challenging and thus distinct models have to be used (\eg, one for profile and one for frontal faces). In this work, we propose the first, to the best of our knowledge, joint multi-view convolutional network to handle large pose variations across faces in-the-wild, and elegantly bridge face detection and facial landmark localisation tasks. Existing joint face detection and landmark localisation methods focus only on a very small set of landmarks. By contrast, our method can detect and align a large number of landmarks for semi-frontal (68 landmarks) and profile (39 landmarks) faces. We evaluate our model on a plethora of datasets including standard static image datasets such as IBUG, 300W, COFW, and the latest Menpo Benchmark for both semi-frontal and profile faces. Significant improvement over state-of-the-art methods on deformable face tracking is witnessed on 300VW benchmark. We also demonstrate state-of-the-art results for face detection on FDDB and MALF datasets.
A novel procedure is presented in this paper, for training a deep convolutional and recurrent neural network, taking into account both the available training data set and some information extracted from similar networks trained with other relevant data sets. This information is included in an extended loss function used for the network training, so that the network can have an improved performance when applied to the other data sets, without forgetting the learned knowledge from the original data set. Facial expression and emotion recognition in-the-wild is the test bed application that is used to demonstrate the improved performance achieved using the proposed approach. In this framework, we provide an experimental study on categorical emotion recognition using datasets from a very recent related emotion recognition challenge.
Abstract Objective . The patterns of brain activity associated with different brain processes can be used to identify different brain states and make behavioural predictions. However, the relevant features are not readily apparent and accessible. Our aim is to design a system for learning informative latent representations from multichannel recordings of ongoing EEG activity. Approach : We propose a novel differentiable decoding pipeline consisting of learnable filters and a pre-determined feature extraction module. Specifically, we introduce filters parameterized by generalized Gaussian functions that offer a smooth derivative for stable end-to-end model training and allow for learning interpretable features. For the feature module, we use signal magnitude and functional connectivity estimates. Main results. We demonstrate the utility of our model on a new EEG dataset of unprecedented size (i.e. 721 subjects), where we identify consistent trends of music perception and related individual differences. Furthermore, we train and apply our model in two additional datasets, specifically for emotion recognition on SEED and workload classification on simultaneous task EEG workload. The discovered features align well with previous neuroscience studies and offer new insights, such as marked differences in the functional connectivity profile between left and right temporal areas during music listening. This agrees with the specialisation of the temporal lobes regarding music perception proposed in the literature. Significance . The proposed method offers strong interpretability of learned features while reaching similar levels of accuracy achieved by black box deep learning models. This improved trustworthiness may promote the use of deep learning models in real world applications. The model code is available at https://github.com/SMLudwig/EEGminer/ .
Multi-Task Learning has emerged as a methodology in which multiple tasks are jointly learned by a shared learning algorithm, such as a DNN. MTL is based on the assumption that the tasks under consideration are related; therefore it exploits shared knowledge for improving performance on each individual task. Tasks are generally considered to be homogeneous, i.e., to refer to the same type of problem. Moreover, MTL is usually based on ground truth annotations with full, or partial overlap across tasks. In this work, we deal with heterogeneous MTL, simultaneously addressing detection, classification & regression problems. We explore task-relatedness as a means for co-training, in a weakly-supervised way, tasks that contain little, or even non-overlapping annotations. Task-relatedness is introduced in MTL, either explicitly through prior expert knowledge, or through data-driven studies. We propose a novel distribution matching approach, in which knowledge exchange is enabled between tasks, via matching of their predictions' distributions. Based on this approach, we build FaceBehaviorNet, the first framework for large-scale face analysis, by jointly learning all facial behavior tasks. We develop case studies for: i) continuous affect estimation, action unit detection, basic emotion recognition; ii) attribute detection, face identification. We illustrate that co-training via task relatedness alleviates negative transfer. Since FaceBehaviorNet learns features that encapsulate all aspects of facial behavior, we conduct zero-/few-shot learning to perform tasks beyond the ones that it has been trained for, such as compound emotion recognition. By conducting a very large experimental study, utilizing 10 databases, we illustrate that our approach outperforms, by large margins, the state-of-the-art in all tasks and in all databases, even in these which have not been used in its training.
Developing powerful deformable face models requires massive, annotated face databases on which techniques can be trained, validated and tested. Manual annotation of each facial image in terms of landmarks requires a trained expert and the workload is usually enormous. Fatigue is one of the reasons that in some cases annotations are inaccurate. This is why, the majority of existing facial databases provide annotations for a relatively small subset of the training images. Furthermore, there is hardly any correspondence between the annotated land-marks across different databases. These problems make cross-database experiments almost infeasible. To overcome these difficulties, we propose a semi-automatic annotation methodology for annotating massive face datasets. This is the first attempt to create a tool suitable for annotating massive facial databases. We employed our tool for creating annotations for MultiPIE, XM2VTS, AR, and FRGC Ver. 2 databases. The annotations will be made publicly available from http://ibug.doc.ic.ac.uk/ resources/facial-point-annotations/. Finally, we present experiments which verify the accuracy of produced annotations.