3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence. Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, which calls for a simple and unified model. In this paper, we propose 3D-VisTA, a pre-trained Transformer for 3D Vision and Text Alignment that can be easily adapted to various downstream tasks. 3D-VisTA simply utilizes self-attention layers for both single-modal modeling and multi-modal fusion without any sophisticated task-specific design. To further enhance its performance on 3D-VL tasks, we construct ScanScribe, the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185 unique indoor scenes originating from ScanNet and 3R-Scan datasets, along with paired 278K scene descriptions generated from existing 3D-VL tasks, templates, and GPT-3. 3D-VisTA is pre-trained on ScanScribe via masked language/object modeling and scene-text matching. It achieves state-of-the-art results on various 3D-VL tasks, ranging from visual grounding and dense captioning to question answering and situated reasoning. Moreover, 3D-VisTA demonstrates superior data efficiency, obtaining strong performance even with limited annotations during downstream task fine-tuning.
The progress in researches on reaction mechanism and kinetics of ethanol steam reforming for hydrogen production using different catalysts was reviewed.For the researches on the kinetics of ethanol steam reforming for hydrogen production,there were two kind kinetics models that were empirical models in the form of power function rate equation and mechanism models in the form of hyperbolic rate equation.The mechanism models based on surface reaction as rate controlling step(RDS) were divided into Langmuir-Hinshelwood mechanism(L-H mechanism) models and Eley-Rideal mechanism(E-R mechanism) models.LangmuirHinshelwood-Hougen-Watson mechanism(LHHW mechanism) models that belong to L-H mechanism were also presented.
We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.
With the decrease of mobile electronic product volume and weight, fingerprint packaging products used for unlocking need thinner structure. Warpage may occur in ultrathin plastic seal products which seriously affects the producibility of the product. Because of the difference of thermal expansion coefficient of each material in fingerprint packaging products, warping deformation is easy to occur after curing. This paper studies the key factors (material characteristics, product structure, etc.) which affect the warpage of fingerprint products. For a fingerprint packaging product, ANSYS is used Simulation analysis software, the material parameters and product structure are simulated and optimized, and the experimental results show that the combination of simulation and experiment can effectively control and optimize the warpage of the package, which is of great significance to ensure the production of fingerprint products.
We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capability.
We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories $\tau$ = {$o_0$, $a_0$, $\dots$} and an imitation learning (IL) policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models (MLMs). With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc. into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the IL policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials.
The principles and characteristics of superheated steam drying was described.The research advancements of its applications in food drying was reviewed,including stimulant research and experimental research,emphasizing different materials dried on several different driers use superheated steam,and dried some materials combined superheated steam drying and other drying technology.Finally,the propose of study in future was given.
Highly conflicting evidence that may lead to the counter-intuitive results is one of the challenges for information fusion in Dempster-Shafer evidence theory. To deal with this issue, evidence conflict is investigated based on belief divergence measuring the discrepancy between evidence. In this paper, the pignistic probability transform belief χ2 divergence, named as BBχ2 divergence, is proposed. By introducing the pignistic probability transform, the proposed BBχ2 divergence can accurately quantify the difference between evidence with the consideration of multi-element sets. Compared with a few belief divergences, the novel divergence has more precision. Based on this advantageous divergence, a new multi-source information fusion method is devised. The proposed method considers both credibility weights and information volume weights to determine the overall weight of each evidence. Eventually, the proposed method is applied in target recognition and fault diagnosis, in which comparative analysis indicates that the proposed method can realize the highest accuracy for managing evidence conflict.
Biodiesel was a environment-friendly and renewable energy.Traditional biodiesel production technologies of homogeneous catalysts had deficiency,such as difficult to separate catalysts and bring about some environmental problems.Several solid base catalysts for biodiesel production were reviewed,and the characteristic was analyzed.The direction of the development of heterogeneous base catalyst for biodiesel production in the future was presented.
Latent space Energy-Based Models (EBMs), also known as energy-based priors, have drawn growing interests in generative modeling. Fueled by its flexibility in the formulation and strong modeling power of the latent space, recent works built upon it have made interesting attempts aiming at the interpretability of text modeling. However, latent space EBMs also inherit some flaws from EBMs in data space; the degenerate MCMC sampling quality in practice can lead to poor generation quality and instability in training, especially on data with complex latent structures. Inspired by the recent efforts that leverage diffusion recovery likelihood learning as a cure for the sampling issue, we introduce a novel symbiosis between the diffusion models and latent space EBMs in a variational learning framework, coined as the latent diffusion energy-based model. We develop a geometric clustering-based regularization jointly with the information bottleneck to further improve the quality of the learned latent space. Experiments on several challenging tasks demonstrate the superior performance of our model on interpretable text modeling over strong counterparts.