Human action recognition is a research hotspot in industry, such as intelligent safeguard and surveillance of abnormal worker behavior. Skeleton-based action recognition has drawn a lot of attention in recent years, as the skeleton is robust to background and directly related to actions. Most studies only exploit 3DCNN to handle multiple frames or introduce optical flow to represent temporal information, without considering the distinctions of temporal features contained in different frames. In this paper, we extend the well-known attention mechanism, the Squeeze-and-Excitation (SE) block, to the temporal dimension, which we termed Temporal Squeeze-and-Excitation (TSE) block. Instead of using the channel reduction in the vanilla SE block, we adopt a channel augmentation to accommodate fewer frames, which gives TSE block sufficient model capacity to capture motion information. Our model achieves good results on the NTU RGB+D dataset by embedding it into existing backbones. Ablation experiments on HMDB51 dataset are conducted to explore characteristics of TSE block.
Operation abnormalities of fused magnesium furnaces (FMFs), e.g., semi-molten, can degrade the product quality and operation performance. The abnormalities can even lead to accidents caused by leakage of the fusing fluids with ultrahigh temperatures. Therefore, it is essential to identify the semi-molten abnormality timely and accurately. In view of the spatiotemporal characteristics of the image sequences of the furnace shell under abnormal conditions of the FMFs, and the existence of strong disturbances caused by water mist, white spot, and flame fluctuation on top of the furnace, this article establishes a novel deep learning architecture for operation abnormality diagnosis with robustness to disturbances of the FMFs. The new scheme is composed of two parts, i.e., a predictive neural networks (PredNet) for disturbance processing, and a deep three-dimensional convolutional recurrent neural networks (3DCRNN) for abnormality diagnosis. First, PredNet-based unsupervised learning is incorporated with image residual extraction for disturbance processing. Second, using the clean image sequences after disturbance processing, a new deep 3DCRNN that integrates three-dimensional CNN (3DCNN) and long-short term memory is proposed for enhanced spatiotemporal feature extraction and semi-molten abnormality diagnosis. The 3DCRNN successfully overcomes the limitation of conventional 3DCNN that focuses on local spatiotemporal extraction and loses the opportunities to capture long-term changes. The experimental results using the image sequences collected from a real FMF demonstrate the effectiveness of the proposed method.
We present a general, fast, and practical solution for interpolating novel views of diverse real-world scenes given a sparse set of nearby views. Existing generic novel view synthesis methods rely on time-consuming scene geometry pre-computation or redundant sampling of the entire space for neural volumetric rendering, limiting the overall efficiency. Instead, we incorporate learned MVS priors into the neural volume rendering pipeline while improving the rendering efficiency by reducing sampling points under the guidance of depth probability distributions. Specifically, fewer but important points are sampled under the guidance of depth probability distributions extracted from the learned MVS architecture. Based on the learned probability-guided sampling, we develop a sophisticated neural volume rendering module that effectively integrates source view information with the learned scene structures. We further propose confidence-aware refinement to improve the rendering results in uncertain, occluded, and unreferenced regions. Moreover, we build a four-view camera system for holographic display and provide a real-time version of our framework for free-viewpoint experience, where novel view images of a spatial resolution of 512×512 can be rendered at around 20 fps on a single GTX 3090 GPU. Experiments show that our method achieves 15 to 40 times faster rendering compared to state-of-the-art baselines, with strong generalization capacity and comparable high-quality novel view synthesis performance.
Research in light field reconstruction focuses on synthesizing novel views with the assistance of depth information. In this paper, we present a learning-based light field reconstruction approach by fusing a set of sheared epipolar plane images (EPIs). We start by showing that a patch in a sheared EPI will exhibit a clear structure when the sheared value equals the depth of that patch. By taking advantage of this pattern, a convolutional neural network (CNN) is then trained to evaluate the sheared EPIs, and output a reference score for fusing the sheared EPIs. The proposed CNN is elaborately designed to learn the similarity degree between the input sheared EPI and the ground truth EPI. Therefore, no depth information is required for network training and reasoning. We demonstrate the high performance of the proposed method through evaluations on synthetic scenes, real-world scenes, and challenging microscope light fields. We also show a further application of our proposed network for depth inference.
Light field (LF) cameras record both intensity and directions of light rays, and encode 3D scenes into 4D LF images. Recently, many convolutional neural networks (CNNs) have been proposed for various LF image processing tasks. However, it is challenging for CNNs to effectively process LF images since the spatial and angular information are highly inter-twined with varying disparities. In this paper, we propose a generic mechanism to disentangle these coupled information for LF image processing. Specifically, we first design a class of domain-specific convolutions to disentangle LFs from different dimensions, and then leverage these disentangled features by designing task-specific modules. Our disentangling mechanism can well incorporate the LF structure prior and effectively handle 4D LF data. Based on the proposed mechanism, we develop three networks (i.e., DistgSSR, DistgASR and DistgDisp) for spatial super-resolution, angular super-resolution and disparity estimation. Experimental results show that our networks achieve state-of-the-art performance on all these three tasks, which demonstrates the effectiveness, efficiency, and generality of our disentangling mechanism. Project page: https://yingqianwang.github.io/DistgLF/.
In the field of unsupervised domain adaptation, most existing methods rely on global features that may not be able to capture fine-grained features, resulting in suboptimal transfer learning performance. Although some approaches have incorporated inter-subdomain relationships across different domains within the same category, the forecasting accuracy of the model within the target domain remains unsatisfactory. In response to the challenge mentioned earlier, a solution has been proposed by us: a self-training approach that integrates pseudo labels. However, it is important to acknowledge that the pseudo labels generated in this iterative process are prone to noise, which can result in the potential dispersion of the target features due to the underlying domain differences. This paper presents Prototype Pseudo-Denoising Adaptation (PPDA), a novel method for addressing challenges in unsupervised domain adaptation. By utilizing feature centroids, known as prototypes, our approach effectively tackles two critical issues in the adaptation process. We leverage the feature distances derived from prototypes to enhance the informational content provided by pseudo labels. This integration allows us to refine the estimation of pseudo label probabilities, facilitating online correction during training. Experimental evaluation on object recognition tasks showcases the significant performance improvement achieved by our proposed method.
For dense sampled light field (LF) reconstruction problem, existing approaches focus on a depth-free framework to achieve non-Lambertian performance. However, they trap in the trade-off "either aliasing or blurring" problem, i.e., pre-filtering the aliasing components (caused by the angular sparsity of the input LF) always leads to a blurry result. In this paper, we intend to solve this challenge by introducing an elaborately designed epipolar plane image (EPI) structure within a learning-based framework. Specifically, we start by analytically showing that decreasing the spatial scale of an EPI shows higher efficiency in addressing the aliasing problem than simply adopting pre-filtering. Accordingly, we design a Laplacian Pyramid EPI (LapEPI) structure that contains both low spatial scale EPI (for aliasing) and high-frequency residuals (for blurring) to solve the trade-off problem. We then propose a novel network architecture for the LapEPI structure, termed as LapEPI-net. To ensure the non-Lambertian performance, we adopt a transfer-learning strategy by first pre-training the network with natural images then fine-tuning it with unstructured LFs. Extensive experiments demonstrate the high performance and robustness of the proposed approach for tackling the aliasing-or-blurring problem as well as the non-Lambertian reconstruction.