Li‐Wen Chang

Hudson Institute

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Trends

Author Order

Document Type

Co-Authors

Wen‐mei Hwu

University of Illinois Urbana-Champaign

H.‐S. Philip Wong

Stanford University

Juan Gómez-Luna

ETH Zurich

Izzat El Hajj

American University of Beirut

Hee-Seok Kim

University of Washington Tacoma

Xinyu Bao

Beijing Forestry University

Christopher Rodrigues

Huawei Technologies (United States)

He Yi

Central South University

Lingfeng Shi

University of Science and Technology of China

I-Jui Sung

University of Illinois Urbana-Champaign

Cooperative Institutions

Liechtenstein Institute

John Wiley & Sons (United States)

Hudson Institute

University of Illinois Urbana-Champaign

National Taiwan University

Central South University

Tianjin University

National Taiwan University Hospital

Chinese Academy of Sciences

Stanford University

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Field

A Study of Energy Saving Control Strategy for an Integrated Environment Control System Applied to Ship Hull Painting

Journal of Marine Science and Technology (2013)

Tzong-Shing Lee Li‐Wen Chang Yew-Khoy Chuah

Mode (computer interface)

10.6119/jmst-012-0430-4

Cite

Citations (0)

Selective Device Structure Scaling and Parasitics Engineering: A Way to Extend the Technology Roadmap

IEEE Transactions on Electron Devices (2009)

Lan Wei Jie Deng Li‐Wen Chang Keunwoo Kim Ching-Te Chuang

We propose a path for extending the technology roadmap when currently considered technology boosters (e.g., strain, high-kappa/metal gate) reach their limits and physical gate length can no longer be effectively scaled down. By judiciously engineering the device parasitic resistance and parasitic capacitance, and considering the impact of the interconnect wiring capacitance, we propose scenarios of selective device structure scaling that will enable technology scaling and contacted gate pitch scaling for several generations beyond the currently perceived limits.

Parasitic extraction

Parasitic capacitance

Parasitic element

Metal gate

10.1109/ted.2008.2010573

Cite

Citations (31)

NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques

arXiv (Cornell University) (2019)

Wenlei Bao Li‐Wen Chang Yang Chen Ke Deng Amit Agarwal

Quantization has emerged to be an effective way to significantly boost the performance of deep neural networks (DNNs) by utilizing low-bit computations. Despite having lower numerical precision, quantized DNNs are able to reduce both memory bandwidth and computation cycles with little losses of accuracy. Integer GEMM (General Matrix Multiplication) is critical to running quantized DNN models efficiently, as GEMM operations often dominate the computations in these models. Various approaches have been developed by leveraging techniques such as vectorization and memory layout to improve the performance of integer GEMM. However, these existing approaches are not fast enough in certain scenarios. We developed NGEMM, a compiler-based GEMM implementation for accelerating lower-precision training and inference. NGEMM has better use of the vector units by avoiding unnecessary vector computation that is introduced during tree reduction. We compared NGEMM's performance with the state-of-art BLAS libraries such as MKL. Our experimental results showed that NGEMM outperformed MKL non-pack and pack version by an average of 1.86x and 1.16x, respectively. We have applied NGEMM to a number of production services in Microsoft.

10.48550/arxiv.1910.00178

Cite

Citations (1)

Comparison and Selection Research of CO₂-Based Transcritical Rankine Cycle Using for Gasoline and Diesel Engine's Waste Heat Recovery

Heat Transfer Engineering (2017)

Gequn Shu Lingfeng Shi Hua Tian Li‐Wen Chang

This paper presents a theoretical study of CO2-based transcritical Rankine cycle (CTRC) for engine's waste heat recovery, involving comparison and selection of four CTRC configurations for two engine types, namely a gasoline engine and a diesel engine. The results of configuration comparison show that the CTRC configuration with both a preheater and a regenerator may be more suitable for both two type engines with water-cooling system. If only recovering the waste heat of exhaust gas, the regenerated CTRC configuration may be more appropriate. The results of engine type comparison show that engine load has slighter effect on the CTRC performance for the gasoline engine compared with that for the diesel engine. Particularly, this paper jointly considers the effect of CTRC weight to evaluate the final CTRC output, which is significant for the vehicle engine. A critical weight is found for the two engines based on 100% engine load, 215 kg for the gasoline engine and 998 kg for the diesel engine, which is the upper limitation of the CTRC weight design. When considering the weight effect, the diesel engine may be the more suitable recovery target compared with the gasoline engine, owing to the more stable reaction of output performance to the CTRC weight.

Engine efficiency

Rankine cycle

Thermal efficiency

Naturally aspirated engine

Heat Engine

Engine power

10.1080/01457632.2017.1325678

Cite

Citations (15)

A scalable, numerically stable, high-performance tridiagonal solver using GPUs

IEEE International Conference on High Performance Computing, Data, and Analytics (2012)

Li‐Wen Chang John A. Stratton Hee-Seok Kim Wen‐mei Hwu

In this paper, we present a scalable, numerically stable, high-performance tridiagonal solver. The solver is based on the SPIKE algorithm for partitioning a large matrix into small independent matrices, which can be solved in parallel. For each small matrix, our solver applies a general 1-by-1 or 2-by-2 diagonal pivoting algorithm, which is also known to be numerically stable. Our paper makes two major contributions. First, our solver is the first numerically stable tridiagonal solver for GPUs. Our solver provides comparable quality of stable solutions to Intel MKL and Matlab, at speed comparable to the GPU tridiagonal solvers in existing packages like CUSPARSE. It is also scalable to multiple GPUs and CPUs. Second, we present and analyze two key optimization strategies for our solver: a high-throughput data layout transformation for memory efficiency, and a dynamic tiling approach for reducing the memory access footprint caused by branch divergence.

Solver

Memory footprint

Tridiagonal matrix algorithm

10.5555/2388996.2389033

Cite

Citations (46)

Diblock copolymer directed self-assembly for CMOS device fabrication

Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE (2006)

Li‐Wen Chang H.‐S. Philip Wong

We present our recent work on using diblock copolymer directed self-assembly for the fabrication of silicon MOSFETs. Instead of using self-assembly to assemble the entire device, we plan to utilize self-assembly to perform one critical step of the complex MOSFET process flow in the beginning. Initial results of using PS-b-PMMA to define pores with hexagonal array having diameter of 20 nm for contact hole patterning will be described. Potential integration issues for making MOSFETs will also be addressed.

10.1117/12.661028

Cite

Citations (18)

Experimental demonstration of aperiodic patterns of directed self-assembly by block copolymer lithography for random logic circuit layout

International Electron Devices Meeting (2010)

Li‐Wen Chang Xinyu Bao Chris Bencher H.‐S. Philip Wong

We have experimentally demonstrated block copolymer lithography in conjunction with optical lithography features on dimensional scales close to the natural pitch of the self-assembling block copolymer. Within this context, the inherent self-assembled shape, size and arrangement will self-adjust to accommodate the external confinement. This added flexibility of directed self-assembly of aperiodic patterns can potentially be used for patterning contact holes for random logic circuit layout.

Aperiodic graph

10.1109/iedm.2010.5703468

Cite

Citations (24)

DySel

ACM SIGPLAN Notices (2016)

Li‐Wen Chang Hee-Seok Kim Wen‐mei Hwu

The rising pressure for simultaneously improving performance and reducing power is driving more diversity into all aspects of computing devices. An algorithm that is well-matched to the target hardware can run multiple times faster and more energy efficiently than one that is not. The problem is complicated by the fact that a program's input also affects the appropriate choice of algorithm. As a result, software developers have been faced with the challenge of determining the appropriate algorithm for each potential combination of target device and data. This paper presents DySel, a novel runtime system for automating such determination for kernel-based data parallel programming models such as OpenCL, CUDA, OpenACC, and C++AMP. These programming models cover many applications that demand high performance in mobile, cloud and high-performance computing. DySel systematically deploys candidate kernels on a small portion of the actual data to determine which achieves the best performance for the hardware-data combination. The test-deployment, referred to as micro-profiling, contributes to the final execution result and incurs less than 8% of overhead in the worst observed case when compared to an oracle. We show four major use cases where DySel provides significantly more consistent performance without tedious effort from the developer.

Profiling (computer programming)

Kernel (algebra)

10.1145/2954679.2872373

Cite

Citations (0)

Block copolymer directed self-assembly enables sublithographic patterning for device fabrication

Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE (2012)

H.‐S. Philip Wong Chris Bencher He Yi Xinyu Bao Li‐Wen Chang

The use of block copolymer self-assembly for device fabrication in the semiconductor industry has been envisioned for over a decade. Early works by the groups of Hawker, Russell, and Nealey [1-2] have shown a high degree of dimensional control of the self-assembled features over large areas with high degree of ordering. The exquisite dimensional control at nanometer-scale feature sizes is one of the most attractive properties of block copolymer self-assembly. At the same time, device and circuit fabrication for the semiconductor industry requires accurate placement of desired features at irregular positions on the chip. The need to coax the self-assembled features into circuit layout friendly location is a roadblock for introducing self-assembly into semiconductor manufacturing. Directed self-assembly (DSA) and the use of topography to direct the self-assembly (graphoepitaxy) have shown great promise in solving the placement problem [3-4]. In this paper, we review recent progress in using block copolymer directed self-assembly for patterning sub-20 nm contact holes for practical circuits.

Nanometre

Semiconductor device fabrication

10.1117/12.918312

Cite

Citations (41)

Toward Performance Portability for CPUs and GPUs Through Algorithmic Compositions

Li‐Wen Chang

Software portability

Source

Cite

Citations (2)