Explore Be-Nice Instruction Scheduling in Open64 for an Embedded SMT Processor
0
Citation
21
Reference
20
Related Paper
Abstract:
A SMT processor can fetch and issue instructions from multiple independent hardware threads at every CPU cycle. Therefore, hardware resources are shared among the concurrently-running threads at a very fine grain level, whi ch can increase the utilization of processor pipeline. Howeve r, the concurrently-running threads in a SMT processor may interfere with each other and stall the CPU pipeline. We call this kind of pipeline stall inter-thread stall (ITS for short) or thread interlock. In this paper, we present our study on the ITS problem on an embedded heterogeneous SMT processor. Our experiments demonstrate that, for some test cases, 50% of the total pipeline stalls are caused by ITS. Therefore, we have developed a new instruction scheduling algorithm called be-nice instruction scheduling, based on Open64 Global Code Motion, to coordinate the conflicts between concurrent threads. The instruction scheduler use s the thread interference information (obtained by profiling ) as heuristics to decrease the number of ITS without sacrificing the overall CPU performance. The experimental results show that, for our current test cases the be-nice instructio n scheduler can reduce 15% of the inter-thread stall cycles, and increase the IPC of the critical thread by 2%-3%. The experiments are performed using the Open64 compiler infrastructure.Keywords:
Stall (fluid mechanics)
Cite
Out-of-order execution
Branch predictor
Speculative multithreading
Cite
Citations (5)
In this paper, we propose a scalable and transparent parallelization scheme using threads for multi-core processor. The performance achieved by our scheme is scalable to the number of cores, and the application program is not affected by the actual number of cores.For the performance efficiency, we designed the threads so that they do not suspend and that they do not start their execution until the data necessary for them are available. We implemented our design using three modules: the dependency controller, which controls dependencies among threads, the thread pool, which manages the ready threads, and the thread dispatcher, which fetches threads from the pool and executes them on the core.Our design and implementation provide efficient thread scheduling with low overhead. Moreover, by hiding the actual number of cores, it realizes transparency. We confirmed the transparency and scalability of our scheme by applying it to the H.264 decoder program. With this scheme, modification of application program is not necessary even if the number of cores changes due to disparate requirements. This feature makes the developing time shorter and contributes to the reduction of the developing cost.
Multi-core processor
Cite
Citations (6)
Energy consumption is a critical issue in embedded systems design. One way of being energy efficient is to complete the execution as early as possible. Multi-threaded processors reduce the execution time by exploiting both the instruction level and thread level parallelism, and offer an effective solution for energy saving. With a typical multi-threaded processor design, whenever the instruction pipeline has to stall due to high latency operations, the processor execution is switched to another thread so that the computing resources are effectively utilized and the processor throughput is improved. However, traditional designs use basic scheduling schemes, such as round robin, in thread selection, which is not suitable for real time execution and is inefficient for a set of threads that have unbalanced execution durations. In this paper, we propose 1) a thread scheduling approach that extends the life span of short threads to ensure the utilization efficiency of processor resources, and 2) zero-switching-time hardware design, to achieve a minimal execution time for a set of given applications. We demonstrate through experiment the effectiveness of our design.
Processor design
Cite
Citations (2)
SIMD
SPMD
Xeon Phi
Execution model
Cite
Citations (1)
Simultaneous multithreading(SMT)can issue and execute multiple instructions from several independent thread each cycle.It greatly increases the throughput of the superscalar processors,but the simultaneous execution of multiple threads also brings some questions,such as the conflicts of hardware resource sharing.Sharing branch predic- tion hardware among multiple threads is one of the questions,this scheme may have great effect on branch prediction accuracy.It is very important to study the effect of branch resolving policies on the performance of SMT processors, because it can give us some advice on SMT processor design.By using SMT processor simulator,this paper tevaluated several famous branch prediction schemes on a SMT architecture that each thread executes independent applications, analyzed the effect of branch prediction schemes on branch prediction accuracy and overall performance of the processor on both single thread and multithread environment.We concluded that.on such a SMT processor,each thread having its own branch predictor is a good candidate,and because each predictor can be small and simple,it also adds little ad- ditional hardware cost.
Simultaneous multithreading
Branch predictor
Microarchitecture
Speculative execution
Superscalar
Speculative multithreading
Cite
Citations (0)
Simultaneous multithreaded (SMT) processors improve the instruction throughput by allowing fetching and running instructions from several threads simultaneously at a single cycle. As the number of competing threads increasing, instruction throughput is largely impacted by fetch policy. We first describe an ideal fetch model and then propose a new effective instruction fetch policy for SMT processors based on the ideal model. The basic idea of our new policy is to select two threads with least instructions in the instruction queue and feed as many as the needed number of instructions to every selected thread, up to eight in total. The key advantage of our policy is that it can utilize the fetch bandwidth more effectively than ICOUNT.2.8 policy, so that a significant increasing in IPC can be achieved. Execution-driven simulation results show that IPC improvements obtained is up to 45%, and 17% on average, over the ICOUNT.2.8 fetch policy.
Fetch
Cite
Citations (3)
We propose dynamic scheduler designs to improve the scheduler scalability and reduce its complexity in the SMT processors. Our first design is an adaptation of the recently proposed instruction packing to SMT. Instruction packing opportunistically packs two instructions (possibly from different threads), each with at most one non-ready source operand at the time of dispatch, into the same issue queue entry. Our second design, termed 2OP/spl I.bar/BLOCK, takes these ideas one step further and completely avoids the dispatching of the instructions with two non-ready source operands. This technique has several advantages. First, it reduces the scheduling complexity (and the associated delays) as the logic needed to support the instructions with 2 non-ready source operands is eliminated. More surprisingly, 2OP/spl I.bar/BLOCK simultaneously improves the performance as the same issue queue entry may be reallocated multiple times to the instructions with at most one non-ready source (which usually spend fewer cycles in the queue) as opposed to hogging the entry with an instruction which enters the queue with two non-ready sources. For the schedulers with the capacity to hold 64 instructions, the 2OP/spl I.bar/BLOCK design outperforms the traditional queue by 11%, on the average, and at the same time results in a 10% reduction in the overall scheduling delay.
Operand
Instruction scheduling
Cite
Citations (21)
Simultaneous multithreading
Speculative multithreading
Speculative execution
Cite
Citations (0)
Instruction scheduling
Simultaneous multithreading
Superscalar
Branch predictor
Speculative execution
Cite
Citations (0)