Private cache partitioning: A method to reduce the off-chip missrate of concurrently executing applications in Chip-Multiprocessors
1
Citation
25
Reference
10
Related Paper
Citation Trend
Abstract:
When there are several application running on Chip-Multiprocessors (CMPs), it is a problem to allocate the on-chip cache capacities between these competing applications. Cache partitioning is commonly used to solve this problem. Existing cache partitioning schemes either dedicate to the shared design or partition the last level cache depending on limited memory information. This paper presents Private Cache Partitioning, a low-overhead, runtime mechanism that partitions all of the private low level caches which are organized as a large shared cache by a distributed directory. The experiment results show that PCP reduces the overall missrate of competing applications and improves the throughput as well as the weighted speedup.Keywords:
Cache pollution
Cache invalidation
Smart Cache
Cache-oblivious algorithm
Bus sniffing
Page cache
Speedup
The optimal size of a large on-chip cache can be different for different programs: at some point, the reduction of cache misses achieved when increasing cache size hits diminishing returns, while the higher cache latency hurts performance. This paper presents the Amorphous Cache (AC), a reconfigurable L2 on-chip cache aimed at improving performance as well as reducing energy consumption. AC is composed of heterogeneous sub-caches as opposed to common caches using homogenous sub-caches. The sub-caches are turned off depending on the application workload to conserve power and minimize latencies. A novel reconfiguration algorithm based on Basic Block Vectors is proposed to recognize program phases, and a learning mechanism is used to select the appropriate cache configuration for each program phase. We compare our reconfigurable cache with existing proposals of adaptive and non-adaptive caches. Our results show that the combination of AC and the novel reconfiguration algorithm provides the best power consumption and performance. For example, on average, it reduces the cache access latency by 55.8%, the cache dynamic energy by 46.5%, and the cache leakage power by 49.3% with respect to a non-adaptive cache.
Cache pollution
Cache invalidation
Smart Cache
Page cache
MESI protocol
Cache-oblivious algorithm
Bus sniffing
Pipeline burst cache
Cite
Citations (6)
Cache invalidation
Cache pollution
Smart Cache
Page cache
MESI protocol
Cache-oblivious algorithm
Locality of reference
Bus sniffing
Cite
Citations (22)
Memory references exhibit locality and are therefore not uniformly distributed across the sets of a cache. This skew reduces the effectiveness of a cache because it results in the caching of a considerable number of less-recently-used lines which are less likely to be re-referenced before they are replaced. In this paper, we describe a technique that dynamically identifies these less-recently-used lines and effectively utilizes the cache frames they occupy to more accurately approximate the global least-recently-used replacement policy while maintaining the fast access time of a direct-mapped cache. We also explore the idea of using these underutilized cache frames to reduce cache misses through data prefetching. In the proposed design, the possible locations that a line can reside in is not predetermined. Instead, the cache is dynamically partitioned into groups of cache lines. Because both the total number of groups and the individual group associativity adapt to the dynamic reference pattern, we call this design the adaptive group-associative cache. Performance evaluation using trace-driven simulations of the TPC-C benchmark and selected programs from the SPEC95 benchmark suite shows that the group-associative cache is able to achieve a hit ratio that is consistently better than that of a 4-way set-associative cache. For some of the workloads, the hit ratio approaches that of a fully-associative cache.
Cache invalidation
Cache pollution
Smart Cache
Page cache
Cache-oblivious algorithm
Bus sniffing
Benchmark (surveying)
MESI protocol
Locality of reference
Cite
Citations (2)
Embedded systems are increasingly using on-chip caches as part of their on-chip memory system. This thesis presents cache mechanisms to improve cache performance and provide opportunities to improve data availability that can lead to more predictable cache performance.
The first cache mechanism presented is an intelligent cache replacement policy that utilizes information about dead data and data that is very frequently used. This mechanism is analyzed theoretically to show that the number of misses using intelligent cache replacement is guaranteed to be no more than the number of misses using traditional LRU replacement. Hardware and software-assisted mechanisms to implement intelligent cache replacement are presented and evaluated.
The second cache mechanism presented is that of cache partitioning which exploits disjoint access sequences that do not overlap in the memory space. A theoretical result is proven that shows that modifying an access sequence into a concatenation of disjoint access sequences is guaranteed to improve the cache hit rate. Partitioning mechanisms inspired by the concept of disjoint sequences are designed and evaluated.
A profile-based analysis, annotation, and simulation framework has been implemented to evaluate the cache mechanisms. This framework takes a compiled benchmark program and a set of program inputs and evaluates various cache mechanisms to provide a range of possible performance improvement scenarios. The proposed cache mechanisms have been evaluated using this framework by measuring cache miss rates and Instructions Per Clock (IPC) information. The results show that the proposed cache mechanisms show promise in improving cache performance and predictability with a modest increase in silicon area. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)
Cache pollution
Cache invalidation
Smart Cache
Page cache
Cache-oblivious algorithm
Bus sniffing
Cite
Citations (2)
As the number of processors sharing a cache increases, conflict misses due to interference amongst competing processes have an increasing impact on the individual performance of processes. Cache partitioning is a method of allocating a cache between concurrently executing processes in order to counteract the effects of inter-process conflicts. However, cache partitioning methods commonly divide a shared cache into private partitions dedicated to a single processor, which can lead to underutilized portions of the cache when set accesses are non-uniform. Our proposed method compliments these cache partitioning algorithms by creating an additional shared partition able to be shared amongst all processors. Underutilized areas of the cache are identified by a monitoring circuit and used for the shared partition. Detection of underutilization is based on the number of unique set accesses for a given allocated way. For a 16-way set associative cache, the implementation of our method requires 64 bytes of storage overhead per core in addition to that needed for the method that determines the sizes of the private partitions. For the tested system, our method is able to improve performance over the traditional LRU policy for a number of selected benchmark sets by an average of 1.4% and up to 13.3% for a two core system and an average of 1.4% and up to 7.8% for a four core system, and is able to improve the performance of a conventional cache partitioning method (Utility-Based Cache Partitioning) by an average of 0.1% and up to 0.5% for both a two and four core systems.
Cache invalidation
Smart Cache
Cache pollution
Cache-oblivious algorithm
Page cache
Bus sniffing
Cite
Citations (4)
Memory references exhibit locality and are therefore not uniformly distributed across the sets of a cache. This skew reduces the effectiveness of a cache because it results in the caching of a considerable number of less-recently-used lines which are less likely to be re-referenced before they are replaced. In this paper, we describe a technique that dynamically identifies these less-recently-used lines and effectively utilizes the cache frames they occupy to more accurately approximate the global least-recently-used replacement policy while maintaining the fast access time of a direct-mapped cache. We also explore the idea of using these underutilized cache frames to reduce cache misses through data prefetching. In the proposed design, the possible locations that a line can reside in is not predetermined. Instead, the cache is dynamically partitioned into groups of cache lines. Because both the total number of groups and the individual group associativity adapt to the dynamic reference pattern, we call this design the adaptive group-associative cache. Performance evaluation using trace-driven simulations of the TPC-C benchmark and selected programs from the SPEC95 benchmark suite shows that the group-associative cache is able to achieve a hit ratio that is consistently better than that of a 4-way set-associative cache. For some of the workloads, the hit ratio approaches that of a fully-associative cache.
Cache invalidation
Cache pollution
Smart Cache
Page cache
Cache-oblivious algorithm
Bus sniffing
Benchmark (surveying)
Locality of reference
MESI protocol
Cite
Citations (4)
For many microprocessors, cache hit time determines the clock cycle. On the other hand, cache miss penalty(measured in instruction issue delays) becomes higher and higher. Conciliating low cache miss ratio with low cache hit time is an important issue. When caches are virtually indexed, the operating system (or some specific hardware) has to manage data consistency of caches and memory. Unfortunately, conciliating physical indexing of the cache and low cache hit time is very difficult. In this paper, we propose the Direct-mapped Access Set-associative Check cache (DASC) for addressing both difficulties. On a DASC cache, the cache array is direct-mapped, so the cache hit time is low. However the tag array is set-associative and the external miss ratio on a DASC cache is the same as the miss ratio on a set-associative cache. When the size of an associativity degree of the tag array is tied to the minimum page size, a virtually indexed but physically tagged DASC cache correctly handles all difficulties associated with cache consistency. Trace driven simulations show that, for cache sizes in the range of 16 to 64 Kbytes and for page sizes in the range 4 to 8 Kbytes, a DASC cache is a valuable trade-off allowing fast cache hit time and low cache miss ratio while cache consistency management is performed by hardware.< >
Cache invalidation
Smart Cache
Cache pollution
Page cache
Bus sniffing
Cache-oblivious algorithm
Cite
Citations (16)
Memory references exhibit locality and are therefore not uniformly distributed across the sets of a cache. This skew reduces the effectiveness of a cache because it results in the caching of a considerable number of less-recently-used lines which are less likely to be re-referenced before they are replaced. In this paper, we describe a technique that dynamically identifies these less-recently-used lines and effectively utilizes the cache frames they occupy to more accurately approximate the global least-recently-used replacement policy while maintaining the fast access time of a direct-mapped cache. We also explore the idea of using these underutilized cache frames to reduce cache misses through data prefetching. In the proposed design, the possible locations that a line can reside in is not predetermined. Instead, the cache is dynamically partitioned into groups of cache lines. Because both the total number of groups and the individual group associativity adapt to the dynamic reference pattern, we call this design the adaptive group-associative cache. Performance evaluation using trace-driven simulations of the TPC-C benchmark and selected programs from the SPEC95 benchmark suite shows that the group-associative cache is able to achieve a hit ratio that is consistently better than that of a 4-way set-associative cache. For some of the workloads, the hit ratio approaches that of a fully-associative cache.
Cache invalidation
Cache pollution
Smart Cache
Page cache
Cache-oblivious algorithm
Bus sniffing
Benchmark (surveying)
Locality of reference
MESI protocol
Cite
Citations (69)
In this paper, we propose three novel cache models using multiple-valued logic (MVL) paradigm to reduce the cache data storage area and cache energy consumption for embedded systems. Multiple-valued caches have significant potential for compact and power-efficient cache array design. The cache models differ from each other depending on whether they store tag and data in binary, radix-r or a mix of both. Our analytical study of cache silicon area shows that an embedded system-on-a-chip (SoC) equipped with a multiple-valued cache model can reduce the cache data storage area up to 6% regardless of cache parameters. Also, our experiments on several embedded benchmarks demonstrate that dynamic cache energy consumption can be reduced up to 62% in a multiple-valued instruction cache in an embedded SoC.
Cache pollution
Cache invalidation
Smart Cache
Page cache
Cache-oblivious algorithm
Bus sniffing
MESI protocol
Cite
Citations (4)
A method to predict block conflicts in direct mapped caches is proposed,which is based on the block replacement behavior.Accordingly,a cache scheme named Conflict Prediction(CP) Cache is presented.Its architecture contains a direct mapped cache and a small fully associative cache,and a conflict prediction table(CPT) which dynamically decides the allocation of fetched blocks from next memory hierarchy.SPEC95 Simulation results show that the performance of CP cache is always better than the traditional direct mapped cache with twice the size of CP cache.With the same block size of 32 bytes,the average improvement of miss ratio of (8+1)kB CP cache with 8 entry CPT is about 12.2% over the traditional 16kB direct mapped cache.By comparison with other similar architectures such as NTS cache and PCS cache,CP cache not only requires less hardware tradeoff and simpler control but also improves hit ratio and bus traffic.
Cache invalidation
Cache pollution
Smart Cache
Page cache
Cache-oblivious algorithm
Bus sniffing
Cite
Citations (0)