Video conferencing applications play an important role in our day-to-day life. They enable people to meet, work, and collaborate remotely, especially in circumstances where physical meetings are not possible (e.g., pandemic scenarios, long distances, etc.). However, such applications might invade people's privacy such as disclosing their sensitive information. In this dataset, we recorded different video conferencing with diverse real and virtual backgrounds, changing subjects, lighting, and so on. If you want to use these datasets for non-commercial purposes, please cite the following papers: @article{nowroozi2020survey, title={A survey of machine learning techniques in adversarial image forensics}, author={Nowroozi, Ehsan and Dehghantanha, Ali and Parizi, Reza M and Choo, Kim-Kwang Raymond}, journal={Computers \& Security}, pages={102092}, year={2020}, publisher={Elsevier} } @article{DBLP:journals/corr/abs-2106-15130, author = {Mauro Conti and Simone Milani and Ehsan Nowroozi and Gabriele Orazi}, title = {Do Not Deceive Your Employer with a Virtual Background: {A} Video Conferencing Manipulation-Detection System}, journal = {CoRR}, volume = {abs/2106.15130}, year = {2021}, url = {https://arxiv.org/abs/2106.15130}, eprinttype = {arXiv}, eprint = {2106.15130}, timestamp = {Mon, 05 Jul 2021 15:15:50 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2106-15130.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
State-of-the-art multimodal semantic segmentation approaches combining LiDAR and color data are usually designed on top of asymmetric information-sharing schemes and assume that both modalities are always available. Regrettably, this strong assumption may not hold in real-world scenarios, where sensors are prone to failure or can face adverse conditions (night-time, rain, fog, etc.) that make the acquired information unreliable. Moreover, these architectures tend to fail in continual learning scenarios. In this work, we re-frame the task of multimodal semantic segmentation by enforcing a tightly-coupled feature representation and a symmetric information-sharing scheme, which allows our approach to work even when one of the input modalities is missing. This makes our model reliable even in safety-critical settings, as is the case of autonomous driving. We evaluate our approach on the SemanticKITTI dataset, comparing it with our closest competitor. We also introduce an ad-hoc continual learning scheme and show results in a class-incremental continual learning scenario that prove the effectiveness of the approach also in this setting.
HTTP Adaptive Streaming (HAS) has become a predominant technique for delivering videos in the Internet. Due to its adaptive behavior according to changing network conditions, it may result in video quality variations that negatively impact the Quality of Experience (QoE) of the user. In this paper, we propose Days of Future Past, an optimization-based Adaptive Bitrate (ABR) algorithm over HTTP/3. Days of Future Past takes advantage of an optimization model and HTTP/3 features, including (i) stream multiplexing and (ii) request cancellation. We design a Mixed Integer Linear Programming (MILP) model that determines the optimal video qualities of both the next segment to be requested and the segments currently located in the buffer. If better qualities for buffered segments are found, the client will send corresponding HTTP GET requests to retrieve them. Multiple segments (i.e., retransmitted segments) might be downloaded simultaneously to upgrade some buffered but not yet played segments to avoid quality decreases using the stream multiplexing feature of QUIC. HTTP/3's request cancellation will be used in case retransmitted segments will arrive at the client after their playout time. The experimental results shows that our proposed method is able to improve the QoE by up to 33.9%.
Most of the past and current video coders partition the input frames into regular blocks of pixels that are approximated by a motion estimation unit and coded via a block-based transform. A better performance can be obtained by adapting the size of the approximated region to the geometry and the characteristics of the objects captured by the camera. The paper presents a novel coding scheme for video+depth signals that combines a 3D object identification unit with an object-oriented motion estimation strategy. Object identification is obtained via a joint luminance-depth oversegmentation of the acquired scene which partitions the input scene into superpixels. The procedure can be easily replicated at the decoder, and therefore, does not imply the coding and transmission of object masks. The scheme outperforms the rate-distortion performance of H.264/AVC of 2 dB with a reasonable increment of the complexity for motion estimation and segmentation.
Microsoft Kinect had a key role in the development of consumer depth sensors being the device that brought depth acquisition to the mass market. Despite the success of this sensor, with the introduction of the second generation, Microsoft has completely changed the technology behind the sensor from structured light to Time-Of-Flight. This paper presents a comparison of the data provided by the first and second generation Kinect in order to explain the achievements that have been obtained with the switch of technology. After an accurate analysis of the accuracy of the two sensors under different conditions, two sample applications, i.e., 3D reconstruction and people tracking, are presented and used to compare the performance of the two sensors.
In the current age, users consume multimedia content in very heterogeneous scenarios in terms of network, hardware, and display capabilities. A naive solution to this problem is to encode multiple independent streams, each covering a different possible requirement for the clients, with an obvious negative impact in both storage and computational requirements. These drawbacks can be avoided by using codecs that enable scalability, i.e., the ability to generate a progressive bitstream, containing a base layer followed by multiple enhancement layers, that allow decoding the same bitstream serving multiple reconstructions and visualization specifications. While scalable coding is a well-known and addressed feature in conventional image and video codecs, this paper focuses on a new and very different problem, notably the development of scalable coding solutions for deep learning-based Point Cloud (PC) coding. The peculiarities of this 3D representation make it hard to implement flexible solutions that do not compromise the other functionalities of the codec. This paper proposes a joint quality and resolution scalability scheme, named Scalable Resolution and Quality Hyperprior (SRQH), that, contrary to previous solutions, can model the relationship between latents obtained with models trained for different RD tradeoffs and/or at different resolutions. Experimental results obtained by integrating SRQH in the emerging JPEG Pleno learning-based PC coding standard show that SRQH allows decoding the PC at different qualities and resolutions with a single bitstream while incurring only in a limited RD penalty and increment in complexity w.r.t. non-scalable JPEG PCC that would require one bitstream per coding configuration.
Nowadays, considering the availability of relatively cheap devices and powerful editing software, video tampering is a relatively easy task. Video sequences can be tampered with by performing, e.g., temporal splicing. However, if the sequences spliced together do not share the same frame rate, they have to be temporally interpolated beforehand. This operation is often made using motion compensated interpolators, which allow to minimize visual artifacts. In this paper we propose a detector of this kind of interpolation. Moreover, the detector is capable of identifying the interpolation factor used, allowing an analyst to uncover the original frame rate of a sequence. This method relies on the analysis of the correlation introduced by the filter adopted by the interpolator. Results show that detection is successful, provided that the number of observed interpolated frames is large enough. Moreover, tests on compressed sequences obtained from television broadcasts validate the method in a real world scenario.