We present a novel approach to generate functional vectors based on assertions for RTL design verification. Our approach combines program-slicing based design extraction, word-level SAT and dynamic searching techniques. Through design extraction, vectors generation need only concern about the design parts related to the given assertion, thus large practical designs can be handled. Constraints Logic Programming (CLP) naturally models mixed bit-level and word-level constraints, and word-level SAT techniques solve the mixed constraints in a unified framework, which gain perfect performance. Initial states derived from dynamic simulation can dramatically accelerate the searching process of functional vectors generation. A prototype system has been built, and the experimental results on some public benchmarks and industrial circuits demonstrate the efficiency of our approach and its applicability to large practical designs.
To address the state explosion problem of parameterized directory based cache protocol in model checking, we put forward the concept of pseudo-cutoff, a bound of the number of nodes which share the same memory block in this paper. Based on the analysis on inherent characteristics of parallel programs, we deduce the pseudo-cutoff value in relaxed consistency Cache-coherent non-uniform memory architecture (CC-NUMA) system under certain conditions. We optimize the state space of parameterized directory-based cache protocol effectively using pseudo-cutoff, and present a new scheme to small probability matter of wide sharing. Experiment results including different system scales show that, the method of protocol model optimization based on pseudo-cutoff could effectively reduce the state space of parameterized cache protocol, accelerates verification speed and improves the capability of verifying large scale Cache protocol.
The growing design-productivity gap has made designers shift toward using high-level synthesis (HLS) techniques to generate register transfer level design from high-level languages. Unfortunately, this translation process is very complex and may introduce bugs into the generated design, which can create a mismatch between what a designer intends and what is actually implemented in the circuit. In this paper, we present an equivalence checking method to validate the result of HLS scheduling against the initial high-level program. Finite state machine with data path (FSMD) models were used to represent designs before and after scheduling. The proposed method uses a bisimulation relation approach to prove equivalence. The automatically established bisimulation relation guarantees that for each execution sequence in the design before scheduling, a related and equivalent execution sequence exists in the design after scheduling and vice versa. Our method provides a unified way to deal with various scheduling optimizations. We have implemented our validation technique and compared it with a state-of-the-art HLS scheduling verification method. The promising results show the effectiveness and efficiency of our method.
The impact of pulse quenching effect on the sensitive area is evaluated by using three-dimensional technology computer-aided design (TCAD) numerical simulation. Simulation results present that the pulse quenching effect could effectively reduce the sensitive area of PMOS transistors. By adopting the off-state gate isolation technique, the sensitive area is further reduced.
On many-core Network-on-Chips (NoCs), communication is on the critical path of system performance and contended synchronization requests may cause large performance penalty. Different from conventional algorithm-based approaches, the paper addresses the barrier synchronization problem from the angle of optimizing its communication performance and proposes cooperative communication as a means to achieve efficient and scalable all-to-all barrier synchronization on mesh-based many-core NoCs. With the cooperative communication, routers collaborate with one another to accomplish a fast barrier synchronization task. The cooperative communication is implemented in our router at low cost. Through comparative experiments, our approach evidently exhibits high efficiency and good scalability.
High performance implementation of matrix multiplication is essential for scientific computing. The memory access procedure is quite possible to be the bottleneck of matrix multiplication. The widely used GotoBLAS GEMM implementation divides the integral matrix into several partitions to be assigned to different cores for parallelization. Traditionally, each core deploys a DMA transfer to access its own partition in the DRAM memory. However, deploying an independent DMA transfer for each core cannot efficiently exploit the inter-core locality. Also, multiple concurrent DMA transfers interfere with each other, further reducing the DRAM access efficiency. We observe that the same row of neighboring partitions is in the same DRAM page, which means that there is significant locality inherent in the address layout. We propose the coordinated DMA to efficiently exploit the locality. It invokes one transfer to serve all cores and moves data in a row-major manner to improve the DRAM access efficiency. Compared with a baseline design, the coordinated DMA improves the bandwidth by 84.8 percent and reduces DRAM energy consumption by 43.1 percent for micro-benchmarks. It achieves higher performance for the GEMM and Linpack benchmark. With much less hardware costs, the coordinated DMA significantly outperforms an out-of-order memory controller.
This paper presents a novel method for automatic functional vectors generation from RT-level HDL descriptions based on path coverage and constraint solving. Compared with existing method, the advantage of this method includes: 1) it avoids generating redundant constraints, which will accelerate the test generation process, 2) it solves the problem of how to propagate the internal values to the primary inputs with decision models, 3) it can handle various HDL description styles, and various styles of designs. Experimental results conduct on several practical designs show that our method can efficiently improve the functional vectors generation process. The prototype system has been applied to verify RTL description of a real 32-bits microprocessor core and complex bugs remained hidden in the RTL descriptions are detected.