Monad: Towards Cost-Effective Specialization for Chiplet-Based Spatial Accelerators
3
Citation
36
Reference
10
Related Paper
Citation Trend
Abstract:
Advanced packaging offers a new design paradigm in the post-Moore era, where many small chiplets can be assembled into a large system. Based on heterogeneous integration, a chiplet-based accelerator can be highly specialized for a specific workload, demonstrating extreme efficiency and cost reduction. To fully leverage this potential, it is critical to explore both the architectural design space for individual chiplets and different integration options to assemble these chiplets, which have yet to be fully exploited by existing proposals. This paper proposes Monad, a cost-aware specialization approach for chiplet-based spatial accelerators that explores the tradeoffs between PPA and fabrication costs. To evaluate a specialized system, we introduce a modeling framework considering the non-uniformity in dataflow, pipelining, and communications when executing multiple tensor workloads on different chiplets. We propose to combine the architecture and integration design space by uniformly encoding the design aspects for both spaces and exploring them with a systematic ML-based approach. The experiments demonstrate that Monad can achieve an average of 16% and 30% EDP reduction compared with the state-of-the-art chiplet-based accelerators, Simba and NN-Baton, respectively.Keywords:
Leverage (statistics)
Monad (category theory)
Cost reduction
Design space exploration
The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, and they directly impact the performance and energy efficiency of DNN accelerator designs. An accelerator microarchitecture dictates the dataflow(s) that can be employed to execute a layer or network. Selecting an optimal dataflow for a layer shape can have a large impact on utilization and energy efficiency, but there is a lack of understanding on the choices and consequences of dataflows, and of tools and methodologies to help architects explore the co-optimization design space. In this work, we first introduce a set of data-centric directives to concisely specify the space of DNN dataflows in a compilerfriendly form. We then show how these directives can be analyzed to infer various forms of reuse and to exploit them using hardware capabilities. We codify this analysis into an analytical cost model, MAESTRO (Modeling Accelerator Efficiency via Spatio-Temporal Reuse and Occupancy), that estimates various cost-benefit tradeoffs of a dataflow including execution time and energy efficiency for a DNN model and hardware configuration. We demonstrate the use of MAESTRO to drive a hardware design space exploration (DSE) experiment, which searches across 480M designs to identify 2.5M valid designs at an average rate of 0.17M designs per second, including Pareto-optimal throughput- and energy-optimized design points.
Design space exploration
Leverage (statistics)
Cite
Citations (5)
This paper presents a Design Space Exploration(DSE) methodology based on a temporal partitioning strategy for mapping of massive computational dataflow problems into FPGAs. In this approach the FPGAs work as co-processors in a hypothetic reconfigurable computing architecture. The temporal partitioning is based on Tabu Search strategies and libraries of IP-cores. This methodology allows Design Space Exploration for optimization of dataflow implementation into the FPGA and pre-runtime analysis. Results of this DSE technique, for synthetic benchmarks, have reached very good performance and sometimes better than others in the literature.
Design space exploration
Reconfigurable Computing
Data-flow analysis
Cite
Citations (1)
This paper presents a dataflow design methodology and an associated co-exploration environment, focusing on the optimization of buffer sizes. The approach is applicable to dynamic dataflow designs and its performance is presented and validated by experimental results on the porting of an MPEG-4 Simple Profile decoder to the STM STHORM manycore platform. For the purpose of this work, the decoder has been written using the RVC-CAL dataflow language standardized by ISO/IEC. Starting from this high-level representation it is demonstrated how the buffer size configuration can be optimized, based on a novel buffer size minimization algorithm suitable for a very general class of dataflow programs.
Porting
Design space exploration
Minification
Representation
Dataflow architecture
Buffer (optical fiber)
Cite
Citations (5)
Analysis of trade-offs between energy efficiency and latency is essential to generate designs complying with a given set of constraints. Improvements in FPGA technologies offer a myriad choices for power and performance optimizations. Various algorithm intrinsic parameters also affect these objectives. The design space is compounded by the available choices. This requires efficient techniques to quickly explore the design space. Current techniques perform Gate/RTL level or functional level power modeling which are slow and hence not scalable. In this work we perform efficient design space exploration using a high level performance model. We develop a semi-automatic design framework to generate energy efficiency and latency trade-offs. The framework develops a performance model given a high level specification of a design with minimal user assistance. It then explores the entire design space to generate the dominating designs with respect to energy efficiency and latency metrics. We illustrate the framework using convolutional neural network which gained significance due to its application in deep learning. We simulate a few designs from the dominating set and show that the performance estimation for the dominating designs are close to the simulated results. We also show that our framework explores 6000 design points per minute on a commodity platform such as Dell workstation as opposed to state-of-the-art techniques which explore at 50 to 60 design points per minute.
Design space exploration
Workstation
Cite
Citations (0)
Deep neural network (DNN) has achieved spectacular success in recent years. In response to DNN's enormous computation demand and memory footprint, numerous inference accelerators have been proposed. However, the diverse nature of DNNs, both at the algorithm level and the parallelization level, makes it hard to arrive at an "one-size-fits-all" hardware design. In this paper, we develop NNest, an early-stage design space exploration tool that can speedily and accurately estimate the area/performance/energy of DNN inference accelerators based on high-level network topology and architecture traits, without the need for low-level RTL codes. Equipped with a generalized spatial architecture framework, NNest is able to perform fast high-dimensional design space exploration across a wide spectrum of architectural/micro-architectural parameters. Our proposed novel date movement strategies and multi-layer fitting schemes allow NNest to more effectively exploit parallelism inherent in DNN. Results generated by NNest demonstrate: 1) previously-undiscovered accelerator design points that can outperform state-of-the-art implementation by 39.3% in energy efficiency; 2) Pareto frontier curves that comprehensively and quantitatively reveal the multi-objective tradeoffs in custom DNN accelerators; 3) holistic design exploration of different level of quantization techniques including recently-proposed binary neural network (BNN).
Design space exploration
Memory footprint
Deep Neural Networks
Cite
Citations (25)
This paper presents the main features of the TURNUS co-exploration environment, an unified design space exploration framework suitable for heterogeneous parallel systems designed using an high level dataflow representation. The main functions of this tool are illustrated through the analysis of a video decoder implemented in the RVC-CAL dataflow language.
Design space exploration
Dataflow architecture
Representation
Cite
Citations (19)
This paper presents a methodology to perform design space exploration of complex signal processing systems implemented using the CAL dataflow language. In the course of space exploration, critical path in dataflow programs is first presented, and then analyzed using a new strategy for computational load reduction. These techniques, together with detecting design bottlenecks, point to the most efficient optimization directions in a complex network. Following these analysis, several new refactoring techniques are introduced and applied on the dataflow program in order to obtain feasible design points in the exploration space. For a MPEG-4 AVC/H.264 decoder software and hardware implementation, the multi-dimensional space can be explored effectively for throughput, resource, and frequency, with real-time decoding range from QCIF to HD resolutions.
Design space exploration
Code refactoring
Cite
Citations (3)
Convolutional Neural Networks (CNNs) have gained widespread popularity in the field of computer vision and image processing. Due to huge computational requirements of CNNs, dedicated hardware-based implementations are being explored to improve their performance. Hardware platforms such as Field Programmable Gate Arrays (FPGAs) are widely being used to design parallel architectures for this purpose. In this paper, we analyze Winograd minimal filtering or fast convolution algorithms to reduce the arithmetic complexity of convolutional layers of CNNs. We explore a complex design space to find the sets of parameters that result in improved throughput and power-efficiency. We also design a pipelined and parallel Winograd convolution engine that improves the throughput and power-efficiency while reducing the computational complexity of the overall system. Our proposed designs show up to 4.75$\times$ and 1.44$\times$ improvements in throughput and power-efficiency, respectively, in comparison to the state-of-the-art design while using approximately 2.67$\times$ more multipliers. Furthermore, we obtain savings of up to 53.6\% in logic resources compared with the state-of-the-art implementation.
Design space exploration
Convolution (computer science)
Gate array
Gate count
Cite
Citations (0)
Convolution Neural Networks (CNNs) are gaining ground in deep learning and Artificial Intelligence (AI) domains, and they can benefit from rapid prototyping in order to produce efficient and low-power hardware designs. The inference process of a Deep Neural Network (DNN) is considered a computationally intensive process that requires hardware accelerators to operate in real-world scenarios due to the low latency requirements of real-time applications. As a result, High-Level Synthesis (HLS) tools are gaining popularity since they provide attractive ways to reduce design time complexity directly in register transfer level (RTL). In this paper, we implement a MobileNetV2 model using a state-of-the-art HLS tool in order to conduct a design space exploration and to provide insights on complex hardware designs which are tailored for DNN inference. Our goal is to combine design methodologies with sparsification techniques to produce hardware accelerators that achieve comparable error metrics within the same order of magnitude with the corresponding state-of-the-art systems while also significantly reducing the inference latency and resource utilization. Toward this end, we apply sparse matrix techniques on a MobileNetV2 model for efficient data representation, and we evaluate our designs in two different weight pruning approaches. Experimental results are evaluated with respect to the CIFAR-10 data set using several different design methodologies in order to fully explore their effects on the performance of the model under examination.
Design space exploration
High-Level Synthesis
Pruning
Cite
Citations (10)
The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, and they directly impact the performance and energy efficiency of DNN accelerator designs. An accelerator microarchitecture dictates the dataflow(s) that can be employed to execute a layer or network. Selecting an optimal dataflow for a layer shape can have a large impact on utilization and energy efficiency, but there is a lack of understanding on the choices and consequences of dataflows, and of tools and methodologies to help architects explore the co-optimization design space. In this work, we first introduce a set of data-centric directives to concisely specify the space of DNN dataflows in a compilerfriendly form. We then show how these directives can be analyzed to infer various forms of reuse and to exploit them using hardware capabilities. We codify this analysis into an analytical cost model, MAESTRO (Modeling Accelerator Efficiency via Spatio-Temporal Reuse and Occupancy), that estimates various cost-benefit tradeoffs of a dataflow including execution time and energy efficiency for a DNN model and hardware configuration. We demonstrate the use of MAESTRO to drive a hardware design space exploration (DSE) experiment, which searches across 480M designs to identify 2.5M valid designs at an average rate of 0.17M designs per second, including Pareto-optimal throughput- and energy-optimized design points.
Design space exploration
Leverage (statistics)
Cite
Citations (5)