Visual attention estimation is an active field of research at the crossroads of different disciplines: computer vision, artificial intelligence and medicine. One of the most common approaches to estimate a saliency map representing attention is based on the observed images. In this paper, we show that visual attention can be retrieved from EEG acquisition. The results are comparable to traditional predictions from observed images, which is of great interest. For this purpose, a set of signals has been recorded and different models have been developed to study the relationship between visual attention and brain activity. The results are encouraging and comparable with other approaches estimating attention with other modalities. The codes and dataset considered in this paper have been made available at \url{https://figshare.com/s/3e353bd1c621962888ad} to promote research in the field.
As new data-sets for real-world visual reasoning and compositional question answering are emerging, it might be needed to use the visual feature extraction as a end-to-end process during training. This small contribution aims to suggest new ideas to improve the visual processing of traditional convolutional network for visual question answering (VQA). In this paper, we propose to modulate by a linguistic input a CNN augmented with self-attention. We show encouraging relative improvements for future research in this direction.
In recent years, deep neural networks have known a wide success in various application domains. However, they require important computational and memory resources, which severely hinders their deployment, notably on mobile devices or for real-time applications. Neural networks usually involve a large number of parameters, which correspond to the weights of the network. Such parameters, obtained with the help of a training process, are determinant for the performance of the network. However, they are also highly redundant. The pruning methods notably attempt to reduce the size of the parameter set, by identifying and removing the irrelevant weights. In this paper, we examine the impact of the training strategy on the pruning efficiency. Two training modalities are considered and compared: (1) fine-tuned and (2) from scratch. The experimental results obtained on four datasets (CIFAR10, CIFAR100, SVHN and Caltech101) and for two different CNNs (VGG16 and MobileNet) demonstrate that a network that has been pre-trained on a large corpus (e.g. ImageNet) and then fine-tuned on a particular dataset can be pruned much more efficiently (up to 80% of parameter reduction) than the same network trained from scratch.
Visual attention estimation is an active field of research at the crossroads of different disciplines: computer vision, deep learning, and medicine. One of the most common approaches to estimate a saliency map representing attention is based on the observed images. In this paper, we show that visual attention can be retrieved from EEG acquisition. The results are comparable to traditional predictions from observed images, which is of great interest. Image-based saliency estimation being participant independent, the estimation from EEG could take into account the subject specificity. For this purpose, a set of signals has been recorded, and different models have been developed to study the relationship between visual attention and brain activity. The results are encouraging and comparable with other approaches estimating attention with other modalities. Being able to predict a visual saliency map from EEG could help in research studying the relationship between brain activity and visual attention. It could also help in various applications: vigilance assessment during driving, neuromarketing, and also in the help for the diagnosis and treatment of visual attention-related diseases. For the sake of reproducibility, the codes and dataset considered in this paper have been made publicly available to promote research in the field.
Sketch-Based Image Retrieval (SBIR) is a crucial task in multimedia retrieval, where the goal is to retrieve a set of images that match a given sketch query. Researchers have already proposed several well-performing solutions for this task, but most focus on enhancing embedding through different approaches such as triplet loss, quadruplet loss, adding data augmentation, and using edge extraction. In this work, we tackle the problem from various angles. We start by examining the training data quality and show some of its limitations. Then, we introduce a Relative Triplet Loss (RTL), an adapted triplet loss to overcome those limitations through loss weighting based on anchors similarity. Through a series of experiments, we demonstrate that replacing a triplet loss with RTL outperforms previous state-of-the-art without the need for any data augmentation. In addition, we demonstrate why batch normalization is more suited for SBIR embeddings than l2-normalization and show that it improves significantly the performance of our models. We further investigate the capacity of models required for the photo and sketch domains and demonstrate that the photo encoder requires a higher capacity than the sketch encoder, which validates the hypothesis formulated in [34]. Then, we propose a straightforward approach to train small models, such as ShuffleNetv2 [22] efficiently with a marginal loss of accuracy through knowledge distillation. The same approach used with larger models enabled us to outperform previous state-of-the-art results and achieve a recall of 62.38% at k = 1 on The Sketchy Database [30].
Introducing sparsity in a convnet has been an efficient way to reduce its complexity while keeping its performance almost intact. Most of the time, sparsity is introduced using a three-stage pipeline: 1) training the model to convergence, 2) pruning the model, 3) fine-tuning the pruned model to recover performance. The last two steps are often performed iteratively, leading to reasonable results but also to a time-consuming process. In our work, we propose to remove the first step of the pipeline and to combine the two others in a single training-pruning cycle, allowing the model to jointly learn the optimal weights while being pruned. We do this by introducing a novel pruning schedule, named One-Cycle Pruning (OCP), which starts pruning from the beginning of the training, and until its very end. Experiments conducted on a variety of combinations between architectures (VGG-16, ResNet-18), datasets (CIFAR-10, CIFAR-100, Caltech-101), and sparsity values (80%, 90%, 95%) show that not only OCP consistently outperforms common pruning schedules such as One-Shot, Iterative and Automated Gradual Pruning, but also that it drastically reduces the required training budget. More-over, experiments following the Lottery Ticket Hypothesis show that OCP allows to find higher quality and more stable pruned networks.
Recent advances in video manipulation techniques have made synthetic media creation more accessible than ever before. Nowadays, video edition is so realistic that we cannot rely exclusively on our senses to assess the veracity of media content. With the amount of manipulated videos doubling every six months, we need sophisticated tools to process the huge amount of media shared all over the internet, to remove the related videos as fast as possible, thus reducing potential harm such as fueling disinformation or reducing trust in mainstream media. In this paper, we tackle the problem of face manipulation detection in video sequences targeting modern facial manipulation techniques. Our method involves two networks: (1) a face identification network, extracting the faces contained in a video, and (2) a manipulation recognition network, considering the face as well as its neighbouring context to find potential artifacts, indicating that the face was manipulated. More particularly, we propose to make use of neural network compression techniques such as pruning and knowledge distillation to create a lightweight solution, able to rapidly process streams of videos. Our approach is validated on the DeepFake Detection Dataset, consisting of videos coming from 5 different manipulation techniques, reflecting the organic content found on the internet, and compared to state-of-the-art deepfake detection approaches.
FasterAI is a PyTorch-based library, aiming to facilitate the utilization of deep neural networks compression techniques such as sparsification, pruning, knowledge distillation, or regularization. The library is built with the purpose of enabling quick implementation and experimentation. More particularly, compression techniques are leveraging Callback systems of libraries such as fastai and Pytorch Lightning to bring a user-friendly and high-level API. The main asset of FasterAI is its lightweight, yet powerful, simplicity of use. Indeed, because it was developed in a very granular way, users can create thousands of unique experiments by using different combinations of parameters. In this paper, we focus on the sparsifying capabilities of FasterAI, which represents the core of the library. Performing sparsification of a neural network in FasterAI only requires a single additional line of code in the traditional training loop, yet allows to perform state-of-the-art techniques such as Lottery Ticket Hypothesis experiments
The advent of sparsity inducing techniques in neural networks has been of a great help in the last few years. Indeed, those methods allowed to find lighter and faster networks, able to perform more efficiently in resource-constrained environment such as mobile devices or highly requested servers. Such a sparsity is generally imposed on the weights of neural networks, reducing the footprint of the architecture. In this work, we go one step further by imposing sparsity jointly on the weights and on the input data. This can be achieved following a three-step process: 1) impose a certain structured sparsity on the weights of the network; 2) track back input features corresponding to zeroed blocks of weight; 3) remove useless weights and input features and retrain the network. Performing pruning both on the network and on input data not only allows for extreme reduction in terms of parameters and operations but can also serve as an interpretation process. Indeed, with the help of data pruning, we now have information about which input feature is useful for the network to keep its performance. Experiments conducted on a variety of architectures and datasets: MLP validated on MNIST, CIFAR10/100 and ConvNets (VGG16 and ResNet18), validated on CIFAR10/100 and CALTECH101 respectively, show that it is possible to achieve additional gains in terms of total parameters and in FLOPs by performing pruning on input data, while also increasing accuracy.