We propose a path for extending the technology roadmap when currently considered technology boosters (e.g., strain, high-kappa/metal gate) reach their limits and physical gate length can no longer be effectively scaled down. By judiciously engineering the device parasitic resistance and parasitic capacitance, and considering the impact of the interconnect wiring capacitance, we propose scenarios of selective device structure scaling that will enable technology scaling and contacted gate pitch scaling for several generations beyond the currently perceived limits.
Quantization has emerged to be an effective way to significantly boost the performance of deep neural networks (DNNs) by utilizing low-bit computations. Despite having lower numerical precision, quantized DNNs are able to reduce both memory bandwidth and computation cycles with little losses of accuracy. Integer GEMM (General Matrix Multiplication) is critical to running quantized DNN models efficiently, as GEMM operations often dominate the computations in these models. Various approaches have been developed by leveraging techniques such as vectorization and memory layout to improve the performance of integer GEMM. However, these existing approaches are not fast enough in certain scenarios. We developed NGEMM, a compiler-based GEMM implementation for accelerating lower-precision training and inference. NGEMM has better use of the vector units by avoiding unnecessary vector computation that is introduced during tree reduction. We compared NGEMM's performance with the state-of-art BLAS libraries such as MKL. Our experimental results showed that NGEMM outperformed MKL non-pack and pack version by an average of 1.86x and 1.16x, respectively. We have applied NGEMM to a number of production services in Microsoft.
This paper presents a theoretical study of CO2-based transcritical Rankine cycle (CTRC) for engine's waste heat recovery, involving comparison and selection of four CTRC configurations for two engine types, namely a gasoline engine and a diesel engine. The results of configuration comparison show that the CTRC configuration with both a preheater and a regenerator may be more suitable for both two type engines with water-cooling system. If only recovering the waste heat of exhaust gas, the regenerated CTRC configuration may be more appropriate. The results of engine type comparison show that engine load has slighter effect on the CTRC performance for the gasoline engine compared with that for the diesel engine. Particularly, this paper jointly considers the effect of CTRC weight to evaluate the final CTRC output, which is significant for the vehicle engine. A critical weight is found for the two engines based on 100% engine load, 215 kg for the gasoline engine and 998 kg for the diesel engine, which is the upper limitation of the CTRC weight design. When considering the weight effect, the diesel engine may be the more suitable recovery target compared with the gasoline engine, owing to the more stable reaction of output performance to the CTRC weight.
In this paper, we present a scalable, numerically stable, high-performance tridiagonal solver. The solver is based on the SPIKE algorithm for partitioning a large matrix into small independent matrices, which can be solved in parallel. For each small matrix, our solver applies a general 1-by-1 or 2-by-2 diagonal pivoting algorithm, which is also known to be numerically stable. Our paper makes two major contributions. First, our solver is the first numerically stable tridiagonal solver for GPUs. Our solver provides comparable quality of stable solutions to Intel MKL and Matlab, at speed comparable to the GPU tridiagonal solvers in existing packages like CUSPARSE. It is also scalable to multiple GPUs and CPUs. Second, we present and analyze two key optimization strategies for our solver: a high-throughput data layout transformation for memory efficiency, and a dynamic tiling approach for reducing the memory access footprint caused by branch divergence.
We present our recent work on using diblock copolymer directed self-assembly for the fabrication of silicon MOSFETs. Instead of using self-assembly to assemble the entire device, we plan to utilize self-assembly to perform one critical step of the complex MOSFET process flow in the beginning. Initial results of using PS-b-PMMA to define pores with hexagonal array having diameter of 20 nm for contact hole patterning will be described. Potential integration issues for making MOSFETs will also be addressed.
We have experimentally demonstrated block copolymer lithography in conjunction with optical lithography features on dimensional scales close to the natural pitch of the self-assembling block copolymer. Within this context, the inherent self-assembled shape, size and arrangement will self-adjust to accommodate the external confinement. This added flexibility of directed self-assembly of aperiodic patterns can potentially be used for patterning contact holes for random logic circuit layout.
The rising pressure for simultaneously improving performance and reducing power is driving more diversity into all aspects of computing devices. An algorithm that is well-matched to the target hardware can run multiple times faster and more energy efficiently than one that is not. The problem is complicated by the fact that a program's input also affects the appropriate choice of algorithm. As a result, software developers have been faced with the challenge of determining the appropriate algorithm for each potential combination of target device and data. This paper presents DySel, a novel runtime system for automating such determination for kernel-based data parallel programming models such as OpenCL, CUDA, OpenACC, and C++AMP. These programming models cover many applications that demand high performance in mobile, cloud and high-performance computing. DySel systematically deploys candidate kernels on a small portion of the actual data to determine which achieves the best performance for the hardware-data combination. The test-deployment, referred to as micro-profiling, contributes to the final execution result and incurs less than 8% of overhead in the worst observed case when compared to an oracle. We show four major use cases where DySel provides significantly more consistent performance without tedious effort from the developer.
The use of block copolymer self-assembly for device fabrication in the semiconductor industry has been envisioned for over a decade. Early works by the groups of Hawker, Russell, and Nealey [1-2] have shown a high degree of dimensional control of the self-assembled features over large areas with high degree of ordering. The exquisite dimensional control at nanometer-scale feature sizes is one of the most attractive properties of block copolymer self-assembly. At the same time, device and circuit fabrication for the semiconductor industry requires accurate placement of desired features at irregular positions on the chip. The need to coax the self-assembled features into circuit layout friendly location is a roadblock for introducing self-assembly into semiconductor manufacturing. Directed self-assembly (DSA) and the use of topography to direct the self-assembly (graphoepitaxy) have shown great promise in solving the placement problem [3-4]. In this paper, we review recent progress in using block copolymer directed self-assembly for patterning sub-20 nm contact holes for practical circuits.