Huiying Lan

Nanchang University

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Trends

Author Order

Document Type

Co-Authors

Yunji Chen

Institute of Computing Technology

Zidong Du

Chinese Academy of Sciences

Tianshi Chen

Cambricon (China)

Shaoli Liu

Yantai University

Qi Guo

Chinese Academy of Sciences

Jinhua Tao

State Key Laboratory of Remote Sensing Science

Shijin Zhang

Affiliated Eye Hospital of Wenzhou Medical College

Lei Zhang

Fraunhofer Research Institution for Microsystems and Solid State Technologies

Lingjie Xu

National Medical Products Administration

Linshan Jiang

National University of Singapore

Cooperative Institutions

Chinese Academy of Sciences

Institute of Computing Technology

University of Chinese Academy of Sciences

Cambricon (China)

Beijing Institute of Technology

Tsinghua University

Sun Yat-sen University

University of Science and Technology of China

Xi'an Jiaotong University

Southwest University of Science and Technology

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Field

DLPlib: A Library for Deep Learning Processor

Journal of Computer Science and Technology (2017)

Huiying Lan Linyang Wu Xiao Zhang Jinhua Tao Xun-Yu Chen

Generality

10.1007/s11390-017-1722-2

Cite

Citations (5)

An Instruction Set Architecture for Machine Learning

ACM Transactions on Computer Systems (2018)

Yunji Chen Huiying Lan Zidong Du Shaoli Liu Jinhua Tao

Machine Learning (ML) are a family of models for learning from the data to improve performance on a certain task. ML techniques, especially recent renewed neural networks (deep neural networks), have proven to be efficient for a broad range of applications. ML techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which usually are not energy efficient, since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators have been proposed recently to improve energy efficiency. However, such accelerators were designed for a small set of ML techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an ML technique (such as layers in neural networks) or even an ML as a whole. Although straightforward and easy to implement for a limited set of similar ML techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different ML techniques with sufficient flexibility and efficiency. In this article, we first propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. We then extend the application scope of Cambricon from NN to ML techniques. We also propose an assembly language, an assembler, and runtime to support programming with Cambricon, especially targeting large-scale ML problems. Our evaluation over a total of 16 representative yet distinct ML techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of ML techniques and provides higher code density than general-purpose ISAs such as x86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [7] (which can only accommodate three types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks and 7 other ML benchmarks. Compared to the recent prevalent ML accelerator PuDianNao, our Cambricon-based accelerator is able to support all the ML techniques as well as the 10 NNs but with only approximate 5.1% performance loss.

10.1145/3331469

Cite

Citations (12)

Leveraging Subgraph Extraction for Performance Portable Programming Frameworks on DL Accelerators

Lecture notes in computer science (2018)

Xiao Zhang Huiying Lan Tian Zhi

Intuition

Software framework

10.1007/978-3-030-05677-3_21

Cite

Citations (1)

DLIR: An Intermediate Representation for Deep Learning Processors

Lecture notes in computer science (2018)

Huiying Lan Zidong Du

Intrinsics

Representation

10.1007/978-3-030-05677-3_19

Cite

Citations (1)

Assembly language and assembler for deep learning accelerators

High Technology Letters (2019)

Huiying Lan Linyang Wu Han Dong Du Zidong

Assembly language

10.3772/j.issn.1006-6748.2019.04.006

Cite

Citations (1)

Cambricon-X: An accelerator for sparse neural networks

Shijin Zhang Zidong Du Lei Zhang Huiying Lan Shaoli Liu

Neural networks (NNs) have been demonstrated to be useful in a broad range of applications such as image recognition, automatic translation and advertisement recommendation. State-of-the-art NNs are known to be both computationally and memory intensive, due to the ever-increasing deep structure, i.e., multiple layers with massive neurons and connections (i.e., synapses). Sparse neural networks have emerged as an effective solution to reduce the amount of computation and memory required. Though existing NN accelerators are able to efficiently process dense and regular networks, they cannot benefit from the reduction of synaptic weights. In this paper, we propose a novel accelerator, Cambricon-X, to exploit the sparsity and irregularity of NN models for increased efficiency. The proposed accelerator features a PE-based architecture consisting of multiple Processing Elements (PE). An Indexing Module (IM) efficiently selects and transfers needed neurons to connected PEs with reduced bandwidth requirement, while each PE stores irregular and compressed synapses for local computation in an asynchronous fashion. With 16 PEs, our accelerator is able to achieve at most 544 GOP/s in a small form factor (6.38 mm ² and 954 mW at 65 nm). Experimental results over a number of representative sparse networks show that our accelerator achieves, on average, 7.23x speedup and 6.43x energy saving against the state-of-the-art NN accelerator.

Speedup

Hardware acceleration

10.1109/micro.2016.7783723

Cite

Citations (522)

FusionFrame: A Fusion Dataflow Scheduling Framework for DNN Accelerators via Analytical Modeling

Lecture notes in computer science (2025)

Liutao Zheng Huiying Lan Xiang Liu Linshan Jiang Xuehai Zhou

10.1007/978-981-96-1551-3_21

Cite

Citations (0)

ED-ViT: Splitting Vision Transformer for Distributed Inference on Edge Devices

arXiv (Cornell University) (2024)

Xiang Liu Yijun Song Xia Li Yifei Sun Huiying Lan

Deep learning models are increasingly deployed on resource-constrained edge devices for real-time data analytics. In recent years, Vision Transformer models and their variants have demonstrated outstanding performance across various computer vision tasks. However, their high computational demands and inference latency pose significant challenges for model deployment on resource-constraint edge devices. To address this issue, we propose a novel Vision Transformer splitting framework, ED-ViT, designed to execute complex models across multiple edge devices efficiently. Specifically, we partition Vision Transformer models into several sub-models, where each sub-model is tailored to handle a specific subset of data classes. To further minimize computation overhead and inference latency, we introduce a class-wise pruning technique that reduces the size of each sub-model. We conduct extensive experiments on five datasets with three model structures, demonstrating that our approach significantly reduces inference latency on edge devices and achieves a model size reduction of up to 28.9 times and 34.1 times, respectively, while maintaining test accuracy comparable to the original Vision Transformer. Additionally, we compare ED-ViT with two state-of-the-art methods that deploy CNN and SNN models on edge devices, evaluating accuracy, inference time, and overall model size. Our comprehensive evaluation underscores the effectiveness of the proposed ED-ViT framework.

10.48550/arxiv.2410.11650

Cite

Citations (0)

Short-Term Traffic Planning and Forecasting System Based on Vehicle-Road Coordination

Lecture notes in electrical engineering (2020)

Bocheng Liu Peijing Cai Huiying Lan Pei Wang

Traffic planning

10.1007/978-981-15-3250-4_115

Cite

Citations (1)

Addressing Sparsity in Deep Neural Networks

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2018)

Xuda Zhou Yunji Chen Zidong Du Shijin Zhang Lei Zhang

Neural networks (NNs) have been demonstrated to be useful in a broad range of applications, such as image recognition, automatic translation, and advertisement recommendation. State-of-the-art NNs are known to be both computationally and memory intensive, due to the ever-increasing deep structure, i.e., multiple layers with massive neurons and connections (i.e., synapses). Sparse NNs have emerged as an effective solution to reduce the amount of computation and memory required. Though existing NN accelerators are able to efficiently process dense and regular networks, they cannot benefit from the reduction of synaptic weights. In this paper, we propose a novel accelerator, Cambricon-X, to exploit the sparsity and irregularity of NN models for increased efficiency. The proposed accelerator features a processing element (PE)-based architecture consisting of multiple PEs. An indexing module efficiently selects and transfers needed neurons to connected PEs with reduced bandwidth requirement, while each PE stores irregular and compressed synapses for local computation in an asynchronous fashion. With 16 PEs, our accelerator is able to achieve at most 544 GOP/s in a small form factor (6.38 mm ² and 954 mW at 65 nm). Experimental results over a number of representative sparse networks show that our accelerator achieves, on average, $7.23\times$ speedup and $6.43\times$ energy saving against the state-of-the-art NN accelerator. We further investigate possibilities of leveraging activation sparsity and multi-issue controller, which improve the efficiency of Cambricon-X. To ease the burden of programmers, we also propose a high efficient library-based programming environment for our accelerator.

Speedup

10.1109/tcad.2018.2864289

Cite

Citations (19)