Efficient Instruction Schedulers for SMT Processors
21
Citation
55
Reference
10
Related Paper
Citation Trend
Abstract:
We propose dynamic scheduler designs to improve the scheduler scalability and reduce its complexity in the SMT processors. Our first design is an adaptation of the recently proposed instruction packing to SMT. Instruction packing opportunistically packs two instructions (possibly from different threads), each with at most one non-ready source operand at the time of dispatch, into the same issue queue entry. Our second design, termed 2OP/spl I.bar/BLOCK, takes these ideas one step further and completely avoids the dispatching of the instructions with two non-ready source operands. This technique has several advantages. First, it reduces the scheduling complexity (and the associated delays) as the logic needed to support the instructions with 2 non-ready source operands is eliminated. More surprisingly, 2OP/spl I.bar/BLOCK simultaneously improves the performance as the same issue queue entry may be reallocated multiple times to the instructions with at most one non-ready source (which usually spend fewer cycles in the queue) as opposed to hogging the entry with an instruction which enters the queue with two non-ready sources. For the schedulers with the capacity to hold 64 instructions, the 2OP/spl I.bar/BLOCK design outperforms the traditional queue by 11%, on the average, and at the same time results in a 10% reduction in the overall scheduling delay.Keywords:
Operand
Instruction scheduling
Recent programmable accelerators are faster and more energy efficient than general purpose processors, but expose complex hardware/software abstractions for compilers. A key problem is instruction scheduling, which requires sophisticated algorithms for mapping instructions to distributed processing elements, routing of operand dependences, and timing the arrival of operands to enable high throughput.
Operand
Instruction scheduling
Cite
Citations (44)
In this paper we describe a design exploration methodology for clustered VLIW architectures. The central idea of this work is a set of three techniques aimed at reducing the cost of expensive inter-cluster copy operations. Instruction scheduling is performed using a list-scheduling algorithm that stores operand chains into the same register file. Functional units are assigned to clusters based on the application inter-cluster communication pattern. Finally, a careful insertion of pipeline bypasses is used to increase the number of data-dependencies that can be satisfied by pipeline register operands. Experimental results, using the SPEC95 benchmark and the IMPACT compiler, reveal a substantial reduction in the number of copies between clusters.
Operand
Register file
Instruction scheduling
Benchmark (surveying)
Spec#
Cite
Citations (9)
Operand
Operator (biology)
Cite
Citations (0)
We propose dynamic scheduler designs to improve the scheduler scalability and reduce its complexity in the SMT processors. Our first design is an adaptation of the recently proposed instruction packing to SMT. Instruction packing opportunistically packs two instructions (possibly from different threads), each with at most one non-ready source operand at the time of dispatch, into the same issue queue entry. Our second design, termed 2OP/spl I.bar/BLOCK, takes these ideas one step further and completely avoids the dispatching of the instructions with two non-ready source operands. This technique has several advantages. First, it reduces the scheduling complexity (and the associated delays) as the logic needed to support the instructions with 2 non-ready source operands is eliminated. More surprisingly, 2OP/spl I.bar/BLOCK simultaneously improves the performance as the same issue queue entry may be reallocated multiple times to the instructions with at most one non-ready source (which usually spend fewer cycles in the queue) as opposed to hogging the entry with an instruction which enters the queue with two non-ready sources. For the schedulers with the capacity to hold 64 instructions, the 2OP/spl I.bar/BLOCK design outperforms the traditional queue by 11%, on the average, and at the same time results in a 10% reduction in the overall scheduling delay.
Operand
Instruction scheduling
Cite
Citations (21)
Centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption and are thus not suitable for consumer electronic devices. The consequence is the emergence of architectures having many interconnected clusters each with a separate register file and a few functional units. Among the many inter-cluster communication models proposed, the extended operand model extends some of operand fields of instruction with a cluster specifier and allows an instruction to read some of the operands from other clusters without any extra cost.Scheduling for clustered processors involves spatial concerns (where to schedule) as well as temporal concerns (when to schedule). A scheduler is responsible for resolving the conflicting requirements of aggressively exploiting the parallelism offered by hardware and limiting the communication among clusters to available slots. This paper proposes an integrated spatial and temporal scheduling algorithm for extended operand clustered VLIW processors and evaluates its effectiveness in improving the run time performance of the code without code size penalty.
Operand
Instruction scheduling
Register file
Limiting
Cite
Citations (26)
Modern DSP processors are capable of performing multiple instructions concurrently. Due to the limitation in internal structure, parallel instructions often impose constraints on the operand data types. In this paper, we propose a novel code generation method which takes both operand data type assignment and instruction scheduling problems into account. The operand data types can be ad-justed during instruction scheduling to create more parallel instructions. A simulated evolution based iterative scheduling algorithm is devised to solve the operand data type assignment, instruction scheduling and register allocation problems in one. Tested benchmark programs indicate that our method can generate very efficient codes in all cases.
Operand
Instruction scheduling
Register allocation
Benchmark (surveying)
Cite
Citations (0)
Recently, Very Long Instruction Word (VLIW) architectures have gained popularity with the advent of EPIC computing and several embedded processors adopting the VLIW computing model. These architectures do not involve run-time reordering of instructions and have lower hardware complexity than out-of-order processors. The performance of a VLIW processor is dependent on the capability of the compiler to statically detect and exploit instruction-level parallelism. Static scheduling of instructions allows reordering over a larger scope than the scheduling window of a dynamically schedulable processor. However, accurate run-time information is not available at compile-time, and the compiler traditionally preserves the conservative program dependencies while scheduling. Some of the program dependencies can be overcome by speculatively executing instructions. Speculation predicts the execution behavior of instructions thereby increasing the number of instructions executing in parallel. This dissertation describes a conceptual framework that enables speculative execution in VLIW processors using prediction techniques. The framework design aims to facilitate translation of a prediction-driven optimization into the VLIW compiler and architecture. The framework has been applied to develop profile-driven optimizations that predict attributes of instruction data, namely computed values and operand bit-widths. Value prediction predicts an instruction value enabling speculative execution of instructions dependent on the predicted value. Bit-width prediction predicts widths of instruction operands allowing instructions with narrow operands to be packed together on a functional unit. Within each optimization, the framework application integrates static analysis and run-time information in a synergistic way. Profiling is used to estimate run-time behavior during static scheduling, and the cost of profiling is reduced by statically identifying the useful predictions of a program. The predictions are performed using novel schemes and captured at run-time by hardware that is able to effectively estimate the dynamic predictability behavior of instructions. The predictions are exploited using compiler transformations that speculate aggressively and recovery schemes that reduce the overhead of misprediction. The efficacy of the optimizations implemented using the framework is evaluated through empirical evaluations on realistic VLIW machine models. The experimental results indicate the framework to be quite effective in addressing the challenges of fine-grain parallelism in VLIW architectures.
Operand
Instruction scheduling
Microarchitecture
Profiling (computer programming)
Speculative execution
Cite
Citations (2)
Modern DSP processors are capable of performing multiple instructions concurrently. Due to the limitation in internal structure, parallel instructions often impose constraints on the operand data types. In this paper, we propose a novel code generation method which takes both operand data type assignment and instruction scheduling problems into account. The operand data types can be adjusted during instruction scheduling to create more parallel instructions. A simulated evolution based iterative scheduling algorithm is devised to solve the operand data type assignment, instruction scheduling and register allocation problems in one. Tested benchmark programs indicate that our method can generate very efficient codes in all cases.
Operand
Instruction scheduling
Register allocation
Benchmark (surveying)
Program optimization
Cite
Citations (6)
In this paper we describe a design exploration methodology for clustered VLIW architectures. The central idea of this work is a set of three techniques aimed at reducing the cost of expensive inter-cluster copy operations. Instruction scheduling is performed using a list-scheduling algorithm that stores operand chains into the same register file. Functional units are assigned to clusters based on the application inter-cluster communication pattern. Finally, a careful insertion of pipeline bypasses is used to increase the number of data-dependencies that can be satisfied by pipeline register operands. Experimental results, using the SPEC95 benchmark and the IMPACT compiler, reveal a substantial reduction in the number of copies between clusters.
Operand
Register file
Instruction scheduling
Benchmark (surveying)
Cite
Citations (0)
Operand
Operator (biology)
Cite
Citations (0)