Resistive random access memory (RRAM) based compute-in-memory (CIM) has shown great potentials for deep neural network (DNN) inference. Prior works generally used off-chip write-verify scheme to tighten the RRAM resistance distribution and used off-chip analog-to-digital converter (ADC) references to fine-tune partial sum quantization edges. Though off-chip techniques are viable for testing purposes, they are unsuitable for practical applications. This work presents an RRAM-CIM macro that features 1) on-chip write-verify to speed up initial weight programming and periodically refresh cells to compensate for resistance drift under stress, and 2) on-chip ADC reference generation that provides column-wise tunability to mitigate offsets induced by process variation to guarantee CIFAR-10 accuracy of >90%. The design is taped-out in TSMC N40 RRAM process, and achieves 36.4TOPS/W for 1×1b MAC operations on VGG-8 network.
Rapid development in deep neural networks (DNNs) is enabling many intelligent applications. However, on-chip training of DNNs is challenging due to the extensive computation and memory bandwidth requirements. To solve the bottleneck of the memory wall problem, compute-in-memory (CIM) approach exploits the analog computation along the bit line of the memory array thus significantly speeds up the vector-matrix multiplications. So far, most of the CIM-based architectures target at implementing inference engine for offline training only. In this article, we propose CIMAT, a CIM Architecture for Training. At the bitcell level, we design two versions of 7T and 8T transpose SRAM to implement bi-directional vector-to-matrix multiplication that is needed for feedforward (FF) and backprogpagation (BP). Moreover, we design the periphery circuitry, mapping strategy and the data flow for the BP process and weight update to support the on-chip training based on CIM. To further improve training performance, we explore the pipeline optimization of proposed architecture. We utilize the mature and advanced CMOS technology at 7 nm to design the CIMAT architecture with 7T/8T transpose SRAM array that supports bi-directional parallel read. We explore the 8-bit training performance of ImageNet on ResNet-18, showing that 7T-based design can achieve 3.38× higher energy efficiency (~6.02 TOPS/W), 4.34× frame rate (~4,020 fps) and only 50 percent chip size compared to the baseline architecture with conventional 6T SRAM array that supports row-by-row read only. The even better performance is obtained with 8T-based architecture, which can reach ~10.79 TOPS/W and ~48,335 fps with 74-percent chip area compared to the baseline.
This paper presents an ADC-free compute-in-memory (CIM) RRAM-based macro, exploiting the fully analog intra-/inter-array computation. The main contributions include: 1) a lightweight input-encoding scheme based on pulse-width modulation (PWM), which improves the compute throughput by ~7 times; 2) a fully analog data processing manner between sub-arrays without explicit ADCs, which does not introduce quantization loss and saves the power by a factor of 11.6. The 40nm prototype chip with TSMC RRAM achieves energy efficiency of 421.53 TOPS/W and compute efficiency of 360 GOPS/mm 2 (normalized to binary operation) at 100MHz.
In the era of big data and artificial intelligence, hardware advancement in throughput and energy efficiency is essential for both cloud and edge computations. Because of the merged data storage and computing units, compute-in-memory is becoming one of the desirable choices for data-centric applications to mitigate the memory wall bottleneck in von-Neumann architecture. In this chapter, the recent architectural designs and underlying circuit/device technologies for compute-in-memory are surveyed. The related design challenges and prospects are also discussed to provide an in-depth understanding of interactions between algorithms/architectures and circuits/devices. The chapter is organized hierarchically: the overview of the field (Introduction section); the principle of compute-in-memory (section "DNNDeep neural networks (DNNs) Basics and Corresponding CIM Principle"); the latest architecture and algorithm techniques including network model, data flow, pipeline design, and quantization approaches (section "Architecture and Algorithm Techniques for CIM"); the related hardware support including embedded memory technologies such as static random access memories and emerging nonvolatile memories, as well as the peripheral circuit designs with a focus on the analog-to-digital converters (section "Hardware Implementations for CIM Architecture"); a summary and outlook of the compute-in-memory architecture (Conclusion section).
3D NAND Flash memory has been proposed as an attractive candidate of inference engine for deep neural network (DNN) owing to its ultra-high density and commercially matured fabrication technology. However, the peripheral circuits require to be modified to enable compute-in-memory (CIM) and the chip architectures need to be redesigned for an optimized dataflow. In this work, we present a design of 3D NAND-CIM accelerator based on the macro parameters from an industry-grade prototype chip. The DNN inference performance is evaluated using the DNN+ NeuroSim framework. To exploit the ultra-high density of 3D NAND Flash, both inputs and weights duplication strategies are introduced to improve the throughput. The benchmarking on a variety of VGG and ResNet networks was performed across technological candidates for CIM including SRAM, RRAM and 3D NAND. Compared to similar designs with SRAM or RRAM, the result shows that 3D NAND based CIM design can achieve not only 17-24% chip size but also 1.9-2.7 times more competitive energy efficiency for 8-bit precision inference.
Machine learning inference engine is of great interest to smart edge computing. Compute-in-memory (CIM) architecture has shown significant improvements in throughput and energy efficiency for hardware acceleration. Emerging non-volatile memory technologies offer great potential for instant on and off by dynamic power gating. Inference engine is typically pre-trained by the cloud and then being deployed to the filed. There are new attack models on chip cloning and neural network model reverse engineering. In this paper, we propose countermeasures to the weight cloning and input-output pair attacks. The first strategy is the weight fine-tune to compensate the analog-to-digital converter (ADC) offset for a specific chip instance while inducing significant accuracy drop for cloned chip instances. The second strategy is the weight shuffle and fake rows insertion to allow accurate propagation of the activations of the neural network only with a key.
Rapid development in deep neural networks (DNNs) is enabling many intelligent applications. However, on-chip training of DNNs is challenging due to the extensive computation and memory bandwidth requirements. To solve the bottleneck of the memory wall problem, compute-in-memory (CIM) approach exploits the analog computation along the bit line of the memory array thus significantly speeds up the vector-matrix multiplications. So far, most of the CIM based prototype chips target at implementing inference engine for offline training only. In this work, we design the data flow for the backpropagation (BP) process and weight update to support the on-chip training based on CIM. We utilize the mature and advanced CMOS technology at 7 nm to design the CIM architecture with 7T transpose SRAM array that supports bidirectional parallel read. We explore the 8-bit training performance of ImageNet on ResNet-18 using this CIM architecture for training, namely CIMAT, showing that it can achieve 3.38× higher energy efficiency (~6.02 TOPS/W), 4.34× frame rate (~4,020 fps) and only 50% chip size compared to the baseline architecture with conventional 6T SRAM array that supports row-by-row read only.
Compute-in-memory (CIM) is an attractive solution to process the extensive workloads of multiply-and-accumulate (MAC) operations in deep neural network (DNN) hardware accelerators. A simulator with options of various mainstream and emerging memory technologies, architectures, and networks can be a great convenience for fast early-stage design space exploration of CIM hardware accelerators. DNN+NeuroSim is an integrated benchmark framework supporting flexible and hierarchical CIM array design options from a device level, to a circuit level and up to an algorithm level. In this study, we validate and calibrate the prediction of NeuroSim against a 40-nm RRAM-based CIM macro post-layout simulations. First, the parameters of a memory device and CMOS transistor are extracted from the foundry’s process design kit (PDK) and employed in the NeuroSim settings; the peripheral modules and operating dataflow are also configured to be the same as the actual chip implementation. Next, the area, critical path, and energy consumption values from the SPICE simulations at the module level are compared with those from NeuroSim. Some adjustment factors are introduced to account for transistor sizing and wiring area in the layout, gate switching activity, post-layout performance drop, etc. We show that the prediction from NeuroSim is precise with chip-level error under 1% after the calibration. Finally, the system-level performance benchmark is conducted with various device technologies and compared with the results before the validation. The general conclusions stay the same after the validation, but the performance degrades slightly due to the post-layout calibration.
Compute-in-memory (CIM) is a promising approach that exploits the analog computation inside the memory array to speed up the vector-matrix multiplication (VMM) for deep neural network (DNN) inference. SRAM has been demonstrated as a mature candidate for CIM architecture due to its availability in advanced technology node. However, as the weights of the DNN model are stationary in the memory cells, it causes potential threats and vulnerabilities for inference engine such as model leaking. This work aims at developing a lightweight yet effective countermeasure to protect the DNN model in CIM architecture. We modify the 6-transistor SRAM bit cell with dual wordlines to implement XOR cipher without sacrificing the parallel computation's efficiency. The evaluations at 28 nm show that XOR-CIM could provide enhanced security and achieve 1.4× energy efficiency improvement and no throughput loss, with only 2.5% area overhead compared to the normal-CIM design without encryption.