Computing is becoming increasingly heterogeneous with accelerators like GPUs being tightly integrated with CPUs on the same die. Extending the CPU's virtual addressing mechanism to these accelerators is a key step in making accelerators easily programmable. In this work, we analyze, using real-system measurements, shared virtual memory across the CPU and an integrated GPU. We make several key observations and highlight consequent research opportunities: (1) servicing a TLB miss from the GPU can be an order of magnitude slower than that from the CPU and consequently it is imperative to enable many concurrent TLB misses to hide this larger latency; (2) divergence in memory accesses impacts the GPU's address translation more than the rest of the memory hierarchy, and research in designing address translation mechanisms tolerant to this effect is imperative; and (3) page faults from the GPU are considerably slower than that from the CPU and software-hardware co-design is essential for efficient implementation of page faults from throughput-oriented accelerators like GPUs. We present a detailed measurement study of a commercially available integrated APU that illustrates these effects and motivates future research opportunities.
The steadily increasing sizes of main memory capacities require corresponding increases in the processor's translation lookaside buffer (TLB) resources to avoid performance bottlenecks. Large operating system page sizes can mitigate the bottleneck with a smaller TLB, but most OSs and applications do not fully utilize the large-page support in current hardware. Recent work has shown that, while not guaranteed, some virtual-to-physical page mappings exhibit "contiguous" spatial locality in which consecutive virtual pages map to consecutive physical pages. Such locality provides opportunities to coalesce "adjacent" TLB entries for increased reach. We observe that beyond simple adjacent-entry coalescing, many more translations exhibit "clustered" spatial locality in which a group or cluster of nearby virtual pages map to a similarly clustered set of physical pages. In this work, we provide a detailed characterization of the spatial locality among the virtual-to-physical translations. Based on this characterization, we present a multi-granular TLB organization that significantly increases its effective reach and reduces miss rates substantially while requiring no additional OS support. Our evaluation shows that the multi-granular design outperforms conventional TLBs and the recently proposed coalesced TLBs technique.
Virtual memory (VM) is critical to the usability and programmability of hardware accelerators. Unfortunately, implementing accelerator VM efficiently is challenging because the area and power constraints make it difficult to employ the large multi-level TLBs used in general-purpose CPUs. Recent research proposals advocate a number of restrictions on virtual-to-physical address mappings in order to reduce the TLB size or increase its reach. However, such restrictions are unattractive because they forgo many of the original benefits of traditional VM, such as demand paging and copy-on-write. We propose SPARTA, a divide and conquer approach to address translation. SPARTA splits the address translation into accelerator-side and memory-side parts. The accelerator-side translation hardware consists of a tiny TLB covering only the accelerator's cache hierarchy (if any), while the translation for main memory accesses is performed by shared memory-side TLBs. Performing the translation for memory accesses on the memory side allows SPARTA to overlap data fetch with translation, and avoids the replication of TLB entries for data shared among accelerators. To further improve the performance and efficiency of the memory-side translation, SPARTA logically partitions the memory space, delegating translation to small and efficient per-partition translation hardware. Our evaluation on index-traversal accelerators shows that SPARTA virtually eliminates translation overhead, reducing it by over 30x on average (up to 47x) and improving performance by 57%. At the same time, SPARTA requires minimal accelerator-side translation hardware, reduces the total number of TLB entries in the system, gracefully scales with memory size, and preserves all key VM functionalities.
Successfully preserving virtual memory will require rearchitecting the hardware-software interface so that these layers operate in tandem, rather than at odds with one another. Encouragingly, there is evidence that both chip vendors and OS designers are willing to innovate at this layer, as seen by a recent implementation of CPU TLB coalescing techniques and rapid changes in GPU address-translation hardware. But several important open problems persist, and new ones are presenting themselves rapidly. As just one example, a recent work by Javier Picorel and colleagues looks at the challenges posed by address translation on near-memory accelerators. The bottom line is that these trends present both an opportunity and a challenge for researchers in computer systems. The evolving landscape of hardware and software means that virtual memory abstraction is in flux, but also that simple mechanisms to mitigate the address translation wall are likely to be useful to real-world systems and products.
Modern computer systems include numerous compute elements, from CPUs to GPUs to accelerators. Harnessing their full potential requires well-defined, properly-implemented memory consistency models (MCMs), and low-level system functionality such as virtual memory and address translation (AT). Unfortunately, it is difficult to specify and implement hardware-OS interactions correctly; in the past, many hardware and OS specification mismatches have resulted in implementation bugs in commercial processors. In an effort to resolve this verification gap, this paper makes the following contributions. First, we present COATCheck, an address translation-aware framework for specifying and statically verifying memory ordering enforcement at the microarchitecture and operating system levels. We develop a domain-specific language for specifying ordering enforcement, for including ordering-related OS events and hardware micro-operations, and for programmatically enumerating happens-before graphs. Using a fast and automated static constraint solver, COATCheck can efficiently analyze interesting and important memory ordering scenarios for modern, high-performance, out-of-order processors. Second, we show that previous work on Virtual Address Memory Consistency (VAMC) does not capture every translation-related ordering scenario of interest, and that some such cases even fall outside the traditional scope of consistency. We therefore introduce the term transistency model to describe the superset of consistency which captures all translation-aware sets of ordering rules.
The last chapter presented recent research efforts targeting more efficient address translation, but focused on hardware optimizations. In this chapter, we study techniques requiring hardware-software co-design. Like the previous chapter, the following discussion presents a non-exhaustive list of recent research. In fact, there is an interesting body of recent work that focuses on purely software optimizations to improve VM performance. While this work is certainly relevant to graduate students exploring this area, it requires detailed discussion of core operating system design and implementation issues, beyond the scope of this synthesis lecture. Nevertheless, we briefly point students to two general streams of recent work on purely software topics:
The VM subsystem is generally on the critical path of every instruction and data reference. Efficient support for VM is therefore important enough that most modern architectures are willing to dedicate hardware to make it as efficient as possible. In this chapter, we dive into some details of the design space of the hardware stack that makes up a modern VM subsystem. We cover both architectural details (such as the contents of the ISA-defined page table entry format) and microarchitectural details (such as the physical layouts of the TLBs).
Large pages have long been used to mitigate address translation overheads on big-memory systems, particularly in virtualized environments where TLB miss overheads are severe. We show, however, that far from being a panacea, large pages are used sparingly by modern virtualization software. This is because large pages often preclude lightweight memory management, which can outweigh their Translation Lookaside Buffer (TLB) benefits. For example, they reduce opportunities to deduplicate memory among virtual machines in overcommitted systems, interfere with lightweight memory monitoring, and hamper the agility of virtual machine (VM) migrations. While many of these problems are particularly severe in overcommitted systems with scarce memory resources, they can (and often do) exist generally in cloud deployments. In response, virtualization software often (though it doesn't have to) splinters guest operating system (OS) large pages into small system physical pages, sacrificing address translation performance for overall system-level benefits. We introduce simple hardware that bridges this fundamental conflict, using speculative techniques to group contiguous, aligned small page translations such that they approach the address translation performance of large pages. Our Generalized Large-page Utilization Enhancements (GLUE) allow system hypervisors to splinter large pages for agile memory management, while retaining almost all of the TLB performance of unsplintered large pages.