An emerging class of applications attempt to make use of both the CPU and GPU in a heterogeneous system. The peak performance for these applications is achieved when both the CPU and GPU are used collaboratively. However, along with this increased gain in performance, power and energy management is a larger challenge. In this paper we address the issue of executing applications that utilize both the CPU and GPU in an energy efficient way. Towards this end, we propose a power management framework named Airavat that tunes the CPU, GPU and memory frequencies, synergestically, in order to improve the energy efficiency of collaborative CPU-GPU applications. Airavat uses machine learning-based prediction models, combined with feedback based Dynamic Voltage and Frequency Scaling to improve the energy efficiency of such applications. We demonstrate our framework on the NVIDIA Jetson TX1 and observe an improvement in terms of Energy Delay Product (EDP) by 24% with negligible performance loss.
General Purpose GPU (GPGPU) computation relies heavily on intrinsic high data-parallelism to achieve significant speedups. However, application programs may not be able to fully utilize these parallel computing resources due to intrinsic data dependencies or complex data pointer operations. In this paper, we use aggressive software-based value prediction techniques on GPUs to accelerate programs that lack inherent data parallelism. This class of applications are typically difficult to map to parallel architectures due to data dependencies and complex data pointers present in the application. Our experimental results show that, despite the overhead incurred due to software speculation and the communication overhead between the CPU and GPU, we obtain up to 6.5x speedup on a selected set of kernels taken from the PARSEC and Sequoia benchmark suites.
To prevent information leakage during program execution, modern software cryptographic implementations target constant-time function, where the number of instructions executed remains the same when program inputs change. However, the underlying microarchitecture behaves differently when processing different data inputs, impacting the execution time of the same instructions. These differences in execution time can covertly leak confidential information through a timing channel. Given the recent reports of covert channels present on commercial microprocessors, a number of microarchitectural features on CPUs have been re-examined from a timing leakage perspective. Unfortunately, a similar microarchitectural evaluation of the potential attack surfaces on GPUs has not been adequately performed. Several prior work has considered a timing channel based on the behavior of a GPU’s coalescing unit. In this article, we identify a second finer-grained microarchitectural timing channel, related to the banking structure of the GPU’s Shared Memory. By considering the timing channel caused by Shared Memory bank conflicts, we have developed a differential timing attack that can compromise table-based cryptographic algorithms. We implement our timing attack on an Nvidia Kepler K40 GPU and successfully recover the complete 128-bit encryption key of an Advanced Encryption Standard (AES) GPU implementation using 900,000 timing samples. We also evaluate the scalability of our attack method by attacking an implementation of the AES encryption algorithm that fully occupies the compute resources of the GPU. We extend our timing analysis onto other Nvidia architectures: Maxwell, Pascal, Volta, and Turing GPUs. We also discuss countermeasures and experiment with a novel multi-key implementation, evaluating its resistance to our side-channel timing attack and its associated performance overhead.
The performance of GPUs is rapidly improving as the top GPU vendors keep pushing the boundaries of process technologies. While larger die sizes help improve performance given the nature of parallel workloads, additional architectural improvements can also help by utilizing the available die real estate more efficiently. Introducing a compressed Last Level Cache (LLC) can make better use of die area, and can improve memory system performance. With widespread adoption of high-resolution displays, most modern game developers are trying to generate high quality graphics output leveraging state-of-the-art GPUs, all of which greatly increases amount of data that needs to be processed. These modern graphics workloads will need to rely on compression to help save memory bandwidth and improve the performance of the LLC. A compressed LLC can help by increasing the hit-rate due to logical cache expansion, as well as provide bandwidth savings due to compressed data on the memory bus. In this paper we propose a novel scheme to extend dynamic dictionary-based compression to store compressed data in memory. Current dictionary-based compression schemes need to decompress the data when a cache block gets evicted. This is because the dynamic dictionary entries are not guaranteed to stay the same and data consistency cannot be maintained. This results in bandwidth savings that is limited to the logical cache expansion. We propose a dual-dictionary scheme (DDC) that can help maintain data consistency, as well as improve bandwidth savings. Our scheme saves bandwidth by coupling logical cache expansion with compressed data on the memory bus. We achieve bandwidth savings of 18.55% for reads and 11.01% for writes, on average, for a diverse range of graphics workloads.
As transistor scaling becomes increasingly more difficult to achieve, scaling the core count on a single GPU chip has also become extremely challenging. As the volume of data to process in today's increasingly parallel workloads continues to grow unbounded, we need to find scalable solutions that can keep up with this increasing demand. To meet the need of modern-day parallel applications, multi-GPU systems offer a promising path to deliver high performance and large memory capacity. However, multi-GPU systems suffer from performance issues associated with GPU-to-GPU communication and data sharing, which severely impact the benefits of multi-GPU systems. Programming multi-GPU systems has been made considerably simpler with the advent of Unified Memory which enables runtime migration of pages to the GPU on demand. Current multi-GPU systems rely on a first-touch Demand Paging scheme, where memory pages are migrated from the CPU to the GPU on the first GPU access to a page. The data sharing nature of GPU applications makes deploying an efficient programmer-transparent mechanism for inter-GPU page migration challenging. Therefore following the initial CPU-to-GPU page migration, the page is pinned on that GPU. Future accesses to this page from other GPUs happen at a cache-line granularity - pages are not transferred between GPUs without significant programmer intervention. We observe that this mechanism suffers from two major drawbacks: 1) imbalance in the page distribution across multiple GPUs, and 2) inability to move the page to the GPU that uses it most frequently. Both of these problems lead to load imbalance across GPUs, degrading the performance of the multi-GPU system. To address these problems, we propose Griffin, a holistic hardware-software solution to improve the performance of NUMA multi-GPU systems. Griffin introduces programmer-transparent modifications to both the IOMMU and GPU architecture, supporting efficient runtime page migration based on locality information. In particular, Griffin employs a novel mechanism to detect and move pages at runtime between GPUs, increasing the frequency of resolving accesses locally, which in turn improves the performance. To ensure better load balancing across GPUs, Griffin employs a Delayed First-Touch Migration policy that ensures pages are evenly distributed across multiple GPUs. Our results on a diverse set of multi-GPU workloads show that Griffin can achieve up to a 2.9× speedup on a multi-GPU system, while incurring low implementation overhead.
Soft errors due to cosmic particles are a growing reliability threat for VLSI systems. In this paper we analyze the soft error vulnerability of FPGAs used in storage systems. Since the reliability requirements of these high performance storage subsystems are very stringent, the reliability of the FPGA chips used in the design of such systems plays a critical role in the overall system reliability. We validate the projections produced by our analytical model by using field error rates obtained from actual field failure data of a large FPGA-based design used in the logical unit module board of a commercial storage system. This comparison confirms that the projections obtained from our analytical tool are accurate (there is an 81% overlap in FIT rate range obtained with our analytical modeling framework and the field failure data studied)