Simultaneous multithreading (SMT) is an interesting way of maximizing performance by enhancing processor utilization. We investigate issues involving the behavior of the memory hierarchy with SMT. First, we show that ignoring L2 cache contention leads to strongly over-estimate the performance one can expect and may lead to incorrect conclusions. We then explore the impact of various memory hierarchy parameters. We show that the number of supported threads has to be set-up according to the cache size, that the L1 caches have to be associative and small blocks have to be used. Then, the hardware constraints put on the design of memory hierarchies should limit the interest of SMT to a few threads.
To achieve high performance on a single process, superscalar processors now rely on very complex out-of-order execution. Using more and more speculative execution (e.g. value prediction) will be needed for further improvements. On the other hand, most operating systems now offer time-shared multiprocess environments. For the moment most of the time is spent in a single thread, but this should change, as the computer will perform more and more independent tasks. Moreover, desktop applications tend to be multithreaded. A lot of users should then be more concerned with the performance throughput on the workload than with the performance of the processor on a single process. Simultaneous multithreading (SMT) is a promising approach to deliver high throughput from superscalar pipelines. In this paper, we show that when executing 4 threads on an SMT processor, out-of-order execution induces small performance benefits over in-order execution. Then, for application domains where performance throughput is more important than ultimate performance on a single application, SMT combined with in-order execution may be a more cost-effective alternative than ultimate aggressive out-of-order superscalar processors or out-of-order execution SMT.
In this paper, we examine the behavior of three of the best performing branch prediction strategies proposed in the literature while executing simultaneously several threads of instructions. Our simulations show that in a multiprogramming environment, if the sizes of the tables (PHT/BTB) are proportional to the number of active threads, there are very few interactions. With parallel workloads, we could have expected a beneficial sharing effect. In fact, it is very dependent an the branch predictors and in the best case, the gains stay very limited. We also show that, for the three predictors, whether in multiprogramming or in parallel processing, if the sizes of the tables are kept small, conflicts in the BTB induce a significant increase in mispredictions. However, for parallel processing with the gshare scheme, the resulting misprediction ratios for 2 or 4 threads stay below those exhibited by 1 thread. Finally, we study the impact of the addition of one Return Address Stack per context and show that a 12-deep stack per thread is sufficient to enhance greatly the accuracy of branch prediction.