Distributed shared memory is an architectural technique for providing a global view of memory in a distributed-store parallel machine by introducing mechanisms which make copies of remote areas of memory when required. One of the major problems of such a system is the performance penalties incurred due to the need to wait for areas of memory to be copied. This can be ameliorated to a certain extent using user annotations, compile-time analysis or run-time prediction to aid pre-fetching of data. This paper proposes a decoupled run-time technique for pre-fetching in a distributed shared memory environment which is applicable in circumstances where static analysis is difficult and the access patterns are sufficiently irregular that run-time prediction may fail. The proposal is in the form of a dual processor structure where one processor performs a partial evaluation of the program and thereby anticipates the need for data fetches before they are required by a second processor which performs the full evaluation.< >
Program optimizations that have been exclusively done by either the architecture or the compiler are now being done by both. This blurred distinction offers opportunities to optimize performance and redefine the compiler-architecture interface. We describe an optimization continuum with compile time and post run time as end points and show how different classes of optimizations fall within it. Most current commercial compilers are still at the compile-time end point, and only a few research efforts are venturing beyond it. As the gap between architecture and compiler closes, there are also attempts to completely redefine the architecture-compiler interface to increase both performance and architectural flexibility.
Decoupled pre-fetching is a technique for reducing the page miss overheads in Distributed Shared Memory systems by separating out those instructions responsible for data fetching from the main instruction stream and running them on a separate CPU whose function is to predict store accesses ahead of time. This approach differs from other pre-fetching approaches in that the predictions of data usage are obtained dynamically from partial evaluation of the program and this promises to produce considerably better performance in circumstances where the access patterns are non-regular and cannot be extracted by static analysis of the program. This paper reviews the techniques of decoupled pre-fetching with particular emphasis on Cache only Memory Architectures (COMA). It then presents a more thorough evaluation of the ideas than has previously been attempted using some of the SPLASH benchmarks. It is shown that the techniques perform well on some programs but that, as expected, the benefits of pre-fetching are negated when there is a high rate of data invalidation caused by global updating.