Xiaojian Ma

Beijing Institute for General Artificial Intelligence

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Trends

Author Order

Document Type

Co-Authors

Shaofei Cai

Chinese Academy of Sciences

Song‐Chun Zhu

Beijing Academy of Artificial Intelligence

Siyuan Huang

Shanghai Artificial Intelligence Laboratory

Qing Li

Hong Kong Polytechnic University

Xiuli Han

Zhengzhou University

Anji Liu

Peking University

Chun Chang

Peking University

Haowei Lin

Beijing Academy of Artificial Intelligence

Yitao Liang

Chun Chang

Zhengzhou University

Cooperative Institutions

Chinese Academy of Sciences

Peking University

Tsinghua University

Zhejiang University

Shanghai Jiao Tong University

University of Chinese Academy of Sciences

Sichuan University

University of Science and Technology of China

Huazhong University of Science and Technology

Zhengzhou University

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Field

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

arXiv (Cornell University) (2023)

Ziyu Zhu Xiaojian Ma Yixin Chen Zhidong Deng Siyuan Huang

3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence. Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, which calls for a simple and unified model. In this paper, we propose 3D-VisTA, a pre-trained Transformer for 3D Vision and Text Alignment that can be easily adapted to various downstream tasks. 3D-VisTA simply utilizes self-attention layers for both single-modal modeling and multi-modal fusion without any sophisticated task-specific design. To further enhance its performance on 3D-VL tasks, we construct ScanScribe, the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185 unique indoor scenes originating from ScanNet and 3R-Scan datasets, along with paired 278K scene descriptions generated from existing 3D-VL tasks, templates, and GPT-3. 3D-VisTA is pre-trained on ScanScribe via masked language/object modeling and scene-text matching. It achieves state-of-the-art results on various 3D-VL tasks, ranging from visual grounding and dense captioning to question answering and situated reasoning. Moreover, 3D-VisTA demonstrates superior data efficiency, obtaining strong performance even with limited annotations during downstream task fine-tuning.

Closed captioning

3D modeling

3d model

10.48550/arxiv.2308.04352

Cite

Citations (1)

Research status and outlook on mechanism and kinetics of ethanol steam reforming for hydrogen production

Natural Gas Chemical Industry (2013)

Xiaojian Ma

The progress in researches on reaction mechanism and kinetics of ethanol steam reforming for hydrogen production using different catalysts was reviewed.For the researches on the kinetics of ethanol steam reforming for hydrogen production,there were two kind kinetics models that were empirical models in the form of power function rate equation and mechanism models in the form of hyperbolic rate equation.The mechanism models based on surface reaction as rate controlling step(RDS) were divided into Langmuir-Hinshelwood mechanism(L-H mechanism) models and Eley-Rideal mechanism(E-R mechanism) models.LangmuirHinshelwood-Hougen-Watson mechanism(LHHW mechanism) models that belong to L-H mechanism were also presented.

Source

Cite

Citations (0)

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

arXiv (Cornell University) (2024)

Yue Fan Xiaojian Ma Rujie Wu Yuntao Du Jiaqi Li

We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.

10.48550/arxiv.2403.11481

Cite

Citations (0)

Study on Warpage after Post Solidifying of Ultrathin Fingerprint Package Products

Ning Sun Xiaojian Ma Qu Fang Qiang Wei Huan Yang

With the decrease of mobile electronic product volume and weight, fingerprint packaging products used for unlocking need thinner structure. Warpage may occur in ultrathin plastic seal products which seriously affects the producibility of the product. Because of the difference of thermal expansion coefficient of each material in fingerprint packaging products, warping deformation is easy to occur after curing. This paper studies the key factors (material characteristics, product structure, etc.) which affect the warpage of fingerprint products. For a fingerprint packaging product, ANSYS is used Simulation analysis software, the material parameters and product structure are simulated and optimized, and the experimental results show that the combination of simulation and experiment can effectively control and optimize the warpage of the package, which is of great significance to ensure the production of fingerprint products.

Image warping

Package on package

10.1109/icept52650.2021.9568133

Cite

Citations (0)

SQA3D: Situated Question Answering in 3D Scenes

arXiv (Cornell University) (2022)

Xiaojian Ma Silong Yong Zilong Zheng Qing Li Yitao Liang

We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capability.

Benchmark (surveying)

Commonsense reasoning

10.48550/arxiv.2210.07474

Cite

Citations (15)

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

arXiv (Cornell University) (2024)

Zihao Wang Shaofei Cai Zhancun Mu Haowei Lin Ceyao Zhang

We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories $\tau$ = {$o_0$, $a_0$, $\dots$} and an imitation learning (IL) policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models (MLMs). With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc. into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the IL policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials.

Lexical analysis

10.48550/arxiv.2407.00114

Cite

Citations (0)

RESEARCH ADVANCEMENTS OF SUPERHEATED STEAM DRYING OF FOOD

Shipin yanjiu yu kaifa (2008)

Xiaojian Ma

The principles and characteristics of superheated steam drying was described.The research advancements of its applications in food drying was reviewed,including stimulant research and experimental research,emphasizing different materials dried on several different driers use superheated steam,and dried some materials combined superheated steam drying and other drying technology.Finally,the propose of study in future was given.

Source

Cite

Citations (0)

Conflict Management Method Based on a New Belief Divergence in Evidence Theory

IEICE Transactions on Information and Systems (2024)

Yin Zhu Xiaojian Ma Hang Wang

Highly conflicting evidence that may lead to the counter-intuitive results is one of the challenges for information fusion in Dempster-Shafer evidence theory. To deal with this issue, evidence conflict is investigated based on belief divergence measuring the discrepancy between evidence. In this paper, the pignistic probability transform belief χ2 divergence, named as BBχ2 divergence, is proposed. By introducing the pignistic probability transform, the proposed BBχ2 divergence can accurately quantify the difference between evidence with the consideration of multi-element sets. Compared with a few belief divergences, the novel divergence has more precision. Based on this advantageous divergence, a new multi-source information fusion method is devised. The proposed method considers both credibility weights and information volume weights to determine the overall weight of each evidence. Eventually, the proposed method is applied in target recognition and fault diagnosis, in which comparative analysis indicates that the proposed method can realize the highest accuracy for managing evidence conflict.

Divergence (linguistics)

Dempster–Shafer theory

Information fusion

Kullback–Leibler divergence

Sensor Fusion

10.1587/transinf.2023edp7102

Cite

Citations (0)

Research Advancement of Solid Basic Catalysts for Biodiesel Production

Technology & Development of Chemical Industry (2008)

Xiaojian Ma

Biodiesel was a environment-friendly and renewable energy.Traditional biodiesel production technologies of homogeneous catalysts had deficiency,such as difficult to separate catalysts and bring about some environmental problems.Several solid base catalysts for biodiesel production were reviewed,and the characteristic was analyzed.The direction of the development of heterogeneous base catalyst for biodiesel production in the future was presented.

Environmentally Friendly

Source

Cite

Citations (0)

Latent Diffusion Energy-Based Model for Interpretable Text Modeling

arXiv (Cornell University) (2022)

Peiyu Yu Sirui Xie Xiaojian Ma Baoxiong Jia Bo Pang

Latent space Energy-Based Models (EBMs), also known as energy-based priors, have drawn growing interests in generative modeling. Fueled by its flexibility in the formulation and strong modeling power of the latent space, recent works built upon it have made interesting attempts aiming at the interpretability of text modeling. However, latent space EBMs also inherit some flaws from EBMs in data space; the degenerate MCMC sampling quality in practice can lead to poor generation quality and instability in training, especially on data with complex latent structures. Inspired by the recent efforts that leverage diffusion recovery likelihood learning as a cure for the sampling issue, we introduce a novel symbiosis between the diffusion models and latent space EBMs in a variational learning framework, coined as the latent diffusion energy-based model. We develop a geometric clustering-based regularization jointly with the information bottleneck to further improve the quality of the learned latent space. Experiments on several challenging tasks demonstrate the superior performance of our model on interpretable text modeling over strong counterparts.

Interpretability

Leverage (statistics)

Regularization

10.48550/arxiv.2206.05895

Cite

Citations (6)