GA–EDA: Hybrid Design Space Exploration Engine for Multicore Architecture
1
Citation
44
Reference
10
Related Paper
Citation Trend
Abstract:
Emergence of modern multicore architectures has made runtime reconfiguration of system resources possible. All reconfigurable system resources constitute a design space and the proper selection of configuration of these resources to improve the system performance is known as Design Space Exploration (DSE). This reconfiguration feature helps in appropriate allocation of system resources to improve the efficiency in terms of performance, energy consumption, throughput, etc. Different techniques like exhaustive search of design space, architect’s experience, etc. are used for optimization of system resources to achieve desired goals. In this work, we hybridized two optimization algorithms, i.e., Genetic Algorithm (GA) and Estimation of Distribution Algorithm (EDA) for DSE of computer architecture. This hybrid algorithm achieved optimal balance between two objectives (minimal energy consumption and maximal throughput) by using decision variables such as number of cores, cache size and operating frequency. The final set of optimal solutions proposed by this GA–EDA hybrid algorithm is explored and verified by running different benchmark applications derived from SPLASH-2 benchmark suite on a cycle level simulator. The significant reduction in energy consumption without extensive impact on throughput in simulation results validate the use of this GA–EDA hybrid algorithm for DSE of multicore architecture. Moreover, the simulation results are compared with that of standalone GA, EDA and fuzzy logic to show the efficiency of GA–EDA hybrid algorithm.Keywords:
Benchmark (surveying)
Design space exploration
Control reconfiguration
Multi-core processor
A predictive dynamic reconfiguration management service is described here, targeting a new generation of multicore SoC that embed multiple heterogeneous reconfigurable cores. The main goal of the service is to hide the reconfiguration overheads, thus permitting more dynamicity for reconfiguring. We describe the implementation of the reconfiguration service managing three heterogeneous cores; functional results are presented on generated multithreaded applications.
Control reconfiguration
Multi-core processor
Cite
Citations (8)
Hardware design processes often come with time-consuming iteration loops, as feedbacks generally result of long synthesis runs. It is even more true when multiple different implementations need to be compared to perform Design Space Exploration (DSE). In order to accelerate such flows and increase agility of developers — closing the gap with software development methodologies — we propose to use quick feedback generating transforms based on RTL circuit analysis for quicker convergence of exploration. We also introduce an Hardware Construction Language (HCL) based methodology to build explorable circuit generators, and demonstrate such usage over a General Matrix Multiply (GEMM) Chisel implementation. We demonstrates that using RTL estimation early in the exploration process results in ×7 less synthesis runs and ×4.1 faster convergence than an exhaustive synthesis process, and still achieves state of the art performances when targetting a Xilinx VC709 FPGA.
Design space exploration
Cite
Citations (0)
Deep neural network (DNN) has achieved spectacular success in recent years. In response to DNN's enormous computation demand and memory footprint, numerous inference accelerators have been proposed. However, the diverse nature of DNNs, both at the algorithm level and the parallelization level, makes it hard to arrive at an "one-size-fits-all" hardware design. In this paper, we develop NNest, an early-stage design space exploration tool that can speedily and accurately estimate the area/performance/energy of DNN inference accelerators based on high-level network topology and architecture traits, without the need for low-level RTL codes. Equipped with a generalized spatial architecture framework, NNest is able to perform fast high-dimensional design space exploration across a wide spectrum of architectural/micro-architectural parameters. Our proposed novel date movement strategies and multi-layer fitting schemes allow NNest to more effectively exploit parallelism inherent in DNN. Results generated by NNest demonstrate: 1) previously-undiscovered accelerator design points that can outperform state-of-the-art implementation by 39.3% in energy efficiency; 2) Pareto frontier curves that comprehensively and quantitatively reveal the multi-objective tradeoffs in custom DNN accelerators; 3) holistic design exploration of different level of quantization techniques including recently-proposed binary neural network (BNN).
Design space exploration
Memory footprint
Deep Neural Networks
Cite
Citations (25)
The necessity of efficient hardware accelerators for image processing kernels is a well known problem. Unlike the conventional HDL based design process, High-level Synthesis (HLS) can directly convert behavioral (C/C++) description into RTL code and can reduce design complexity, design time as well as provide user opportunity for design space exploration. Due to the vast optimization possibilities in HLS, a proper application level behavioral characterization is necessary to understand the leverages offered by these workloads especially for facilitating parallel computation. In this work, we present a set of HLS optimization strategies derived upon exploiting the most general HLS influential characteristic features of image processing algorithms. We also present an HLS benchmark suite ImageSpec to demonstrate our strategies and their efficiency in optimizing workloads spanning diverse domains within image processing sector. We have shown that an average performance to hardware gain of 143x could be achieved over the baseline implementation using our optimization strategies.
High-Level Synthesis
Design space exploration
Benchmark (surveying)
Code (set theory)
Cite
Citations (2)
Convolutional Neural Networks (CNNs) have gained widespread popularity in the field of computer vision and image processing. Due to huge computational requirements of CNNs, dedicated hardware-based implementations are being explored to improve their performance. Hardware platforms such as Field Programmable Gate Arrays (FPGAs) are widely being used to design parallel architectures for this purpose. In this paper, we analyze Winograd minimal filtering or fast convolution algorithms to reduce the arithmetic complexity of convolutional layers of CNNs. We explore a complex design space to find the sets of parameters that result in improved throughput and power-efficiency. We also design a pipelined and parallel Winograd convolution engine that improves the throughput and power-efficiency while reducing the computational complexity of the overall system. Our proposed designs show up to 4.75$\times$ and 1.44$\times$ improvements in throughput and power-efficiency, respectively, in comparison to the state-of-the-art design while using approximately 2.67$\times$ more multipliers. Furthermore, we obtain savings of up to 53.6\% in logic resources compared with the state-of-the-art implementation.
Design space exploration
Convolution (computer science)
Gate count
Gate array
Cite
Citations (0)
Convolutional Neural Networks (CNNs) have gained widespread popularity in the field of computer vision and image processing. Due to huge computational requirements of CNNs, dedicated hardware-based implementations are being explored to improve their performance. Hardware platforms such as Field Programmable Gate Arrays (FPGAs) are widely being used to design parallel architectures for this purpose. In this paper, we analyze Winograd minimal filtering or fast convolution algorithms to reduce the arithmetic complexity of convolutional layers of CNNs. We explore a complex design space to find the sets of parameters that result in improved throughput and power-efficiency. We also design a pipelined and parallel Winograd convolution engine that improves the throughput and power-efficiency while reducing the computational complexity of the overall system. Our proposed designs show up to 4.75$\times$ and 1.44$\times$ improvements in throughput and power-efficiency, respectively, in comparison to the state-of-the-art design while using approximately 2.67$\times$ more multipliers. Furthermore, we obtain savings of up to 53.6\% in logic resources compared with the state-of-the-art implementation.
Design space exploration
Convolution (computer science)
Gate array
Gate count
Cite
Citations (0)
Convolution Neural Networks (CNNs) are gaining ground in deep learning and Artificial Intelligence (AI) domains, and they can benefit from rapid prototyping in order to produce efficient and low-power hardware designs. The inference process of a Deep Neural Network (DNN) is considered a computationally intensive process that requires hardware accelerators to operate in real-world scenarios due to the low latency requirements of real-time applications. As a result, High-Level Synthesis (HLS) tools are gaining popularity since they provide attractive ways to reduce design time complexity directly in register transfer level (RTL). In this paper, we implement a MobileNetV2 model using a state-of-the-art HLS tool in order to conduct a design space exploration and to provide insights on complex hardware designs which are tailored for DNN inference. Our goal is to combine design methodologies with sparsification techniques to produce hardware accelerators that achieve comparable error metrics within the same order of magnitude with the corresponding state-of-the-art systems while also significantly reducing the inference latency and resource utilization. Toward this end, we apply sparse matrix techniques on a MobileNetV2 model for efficient data representation, and we evaluate our designs in two different weight pruning approaches. Experimental results are evaluated with respect to the CIFAR-10 data set using several different design methodologies in order to fully explore their effects on the performance of the model under examination.
Design space exploration
High-Level Synthesis
Pruning
Cite
Citations (10)
Systolic array is one of the popular convolutional neural network accelerator architectures due to its high computation efficiency. Nevertheless, the huge design space and complicated interactions among different design parameters make it hard to find the best configuration for various applications. To overcome this issue, this paper presents an evaluation and design space exploration engine, NNeed, for systolic-array CNN accelerators through extensive dataflow analysis. It uses a highly configurable hardware template to describe accelerator operations in detail. The rapid evaluation provides PPA results, pipeline stage analysis, external memory access statistics, and so on. NNeed explores the 9-dimensional design space and supports multiple objective functions for design optimization. Experimental results show that NNeed can generate an accelerator configuration with up to 23% and 50% improvement in performance and energy as compared with a typical handcrafted design.
Design space exploration
Dataflow architecture
Design flow
Design methods
Cite
Citations (0)
According to literature, designers spend up to 30% of the design time on optimizing data representations in signal processing architectures [13]. Reference implementations, mostly in high-level software languages, choose floating point representation for mathematical calculations, which are too resource-intensive for FPGA implementations in many cases. The task of conversion to bit-width-optimized fixed point representations is tedious and therefore warrants automation. Usually some analytical or simulation-based approach is used for this, but past works usually overcomplicate their mode of operation and are therefore not commonplace in FPGA design. In this work, it is shown that a simulation-based approach can be both fast, given modern hardware, as well as simple enough to be integrated into a modern design flow. Using a real-world design from a complex power quality measurement algorithm, this is demonstrated and evaluated. Our implementation was able to reach much better results by reducing the resource utilization by approximately 80%, compared to the bit-widths proposed by a field expert while retaining the accuracy needed for the target application.
Datapath
Design flow
Design space exploration
Electronic design automation
Implementation
Representation
Cite
Citations (0)
The vast number of transistors available through modern fabrication technology gives architects an unprecedented amount of freedom in chip-multiprocessor (CMP) designs. However, such freedom translates into a design space that is impossible to fully, or even partially to any significant fraction, explore through detailed simulation. In this paper we propose to address this problem using predictive modeling, a well-known machine learning technique. More specifically we build models that, given only a minute fraction of the design space, are able to accurately predict the behavior of the remaining designs orders of magnitude faster than simulating them. In contrast to previous work, our models can predict performance metrics not only for unseen CMP configurations for a given application, but also for unseen configurations of a new application that was not in the set of applications used to build the model, given only a very small number of results for this new application. We perform extensive experiments to show the efficacy of the technique for exploring the design space of CMP's running parallel applications. The technique is used to predict both energy-delay and execution time. Choosing both explicitly parallel applications and applications that are parallelized using the thread-level speculation (TLS) approach, we evaluate performance on a CMP design space with about 95 million points using 18 benchmarks with up to 1000 training points each. For predicting the energy-delay metric, prediction errors for unseen configurations of the same application range from 2.4% to 4.6% and for configurations of new applications from 3.1% to 4.9%.
Design space exploration
Performance metric
Multi-core processor
Cite
Citations (45)