One of the major issues in multimedia forensics is the identification of video acquisition devices. Most of the relevant state-of-the-art solutions rely on either visual or audio analysis, using feature arrays that are highly correlated with the characteristics of the respective camera or microphone. In this work, we present a multi-modal approach that uses both video and audio information to improve the detection accuracy. For this purpose, microphone detection based on the blind estimation of the frequency response is complemented with a video camera detection based on a set of video features related to the Color Filter Array interpolation. Experimental results show that the combined approach results in an improved overall classification accuracy over the mono-modal cases.
Automated Teller Machines (ATMs) represent the most used system for withdrawing cash. The European Central Bank reported more than 11 billion cash withdrawals and loading/unloading transactions on the European ATMs in 2019. Although ATMs have undergone various technological evolutions, Personal Identification Numbers (PINs) are still the most common authentication method for these devices. Unfortunately, the PIN mechanism is vulnerable to shoulder-surfing attacks performed via hidden cameras installed near the ATM to catch the PIN pad. To overcome this problem, people get used to covering the typing hand with the other hand. While such users probably believe this behavior is safe enough to protect against mentioned attacks, there is no clear assessment of this countermeasure in the scientific literature. This paper proposes a novel attack to reconstruct PINs entered by victims covering the typing hand with the other hand. We consider the setting where the attacker can access an ATM PIN pad of the same brand/model as the target one. Afterward, the attacker uses that model to infer the digits pressed by the victim while entering the PIN. Our attack owes its success to a carefully selected deep learning architecture that can infer the PIN from the typing hand position and movements. We run a detailed experimental analysis including 58 users. With our approach, we can guess 30% of the 5-digit PINs within three attempts -- the ones usually allowed by ATM before blocking the card. We also conducted a survey with 78 users that managed to reach an accuracy of only 7.92% on average for the same setting. Finally, we evaluate a shielding countermeasure that proved to be rather inefficient unless the whole keypad is shielded.
The reconstruction of 3D point cloud models from unordered and uncalibrated sets of images has recently been a hot topic in the computer vision world. Most of the proposed solutions rely on the Structure-From-Motion algorithms, and their performances are significantly affected by the processing order (called track) of the considered images. This is computed according to a distance (or similarity) metric between couples of images, which is usually highly noisy. The paper proposes an image ordering strategy that models the distances between images as an Euclidean distance matrix and applies a rank-based denoising algorithm in order to refine the metric values. Experimental results prove that the accuracy of the final 3D model is sensibly improved.
One of the latest innovations in the world of multimedia technologies is the application of Distributed Source Coding (DSC) theory to the robust transmission of video sequences. However, these DSC-based video encoders are usually characterized by a lower compression gain with respect to their hybrid counterparts. In this work, we investigate achieving H.264-like high compression efficiency with a DSC-based approach without constraints on encoding complexity. In this way, highly non-stationary video data are modelled through coarse low-cost Motion Estimation, and this allows us to obtain a good compression efficiency together with a certain robustness to channel errors and losses. Experimental results show that in presence of losses the presented algorithm permits a better quality with respect to H.264/AVC.
The capability of associating an image to its geographical location is a significant concern in journalism and digital forensics. Given the availability of geo-tagged satellite imagery for most of the Earth's surface, retrieving the location of a generic picture can be addressed as a cross-view image matching between aerial and ground views. In this paper, we outline some initial steps toward the development of a fully-unsupervised algorithm for ground-to-aerial image matching, exploiting the view-invariant adjacency relationships of the landmarks appearing in both views. We introduce a graph-based strategy that, given a set of pre-extracted landmarks, localizes the viewpoint of a ground-level 360-degree image within a broad aerial view of the same area, by matching the respective landmark graphs according to a specifically designed likelihood model.
The recent integration of generative neural strategies and audio processing techniques have fostered the widespread of synthetic speech synthesis or transformation algorithms. This capability proves to be harmful in many legal and informative processes (news, biometric authentication, audio evidence in courts, etc.). Thus, the development of efficient detection algorithms is both crucial and challenging due to the heterogeneity of forgery techniques.This work investigates the discriminative role of silenced parts in synthetic speech detection and shows how first digit statistics extracted from MFCC coefficients can efficiently enable a robust detection. The proposed procedure is computationally-lightweight and effective on many different algorithms since it does not rely on large neural detection architecture and obtains an accuracy above 90% in most of the classes of the ASVSpoof dataset.