David Brooks

Harvard University Press

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Trends

Author Order

Document Type

Co-Authors

Gu-Yeon Wei

Harvard University Press

206

Brandon Reagen

New York University

Carole-Jean Wu

Arizona State University

Udit Gupta

Cornell University

Vijay Janapa Reddi

Harvard University Press

Pradip Bose

IBM Research - Thomas J. Watson Research Center

Mark Hempstead

Tufts University

Marco Donato

Tufts University

Benjamin C. Lee

Weill Cornell Medicine

Hsien-Hsin S. Lee

Intel (United States)

Cooperative Institutions

Harvard University Press

Harvard University

Meta (Israel)

Intel (United States)

IBM (United States)

University of Illinois Urbana-Champaign

Pennsylvania State University

Cornell University

IBM Research - Thomas J. Watson Research Center

Ghent University Hospital

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Field

System-on-Chip Architecture Design for Intelligent Sensor Networks

Wai-Chi Fang S. Kedar S. E. Owen Gu-Yeon Wei David Brooks

While wireless sensor networks can generically be used for a wide variety of applications, breakthrough innovations are most often achieved when driven by a genuine need or application, with its specific system-level and science-related requirements and objectives. Hence, our work focuses on the development of wireless sensor network system-on-chip devices and supporting software for volcano monitoring, which we call Sensor Network for Active Volcanoes (SNAV). In this paper we present preliminary results of our research and development work on intelligent sensor networks for monitoring hazardous environments especially the SNAV system-on-chip design for active volcanoes monitoring.

Intelligent sensor

Sensor web

10.1109/iih-msp.2006.160

Cite

Citations (1)

Gradient Disaggregation: Breaking Privacy in Federated Learning by Reconstructing the User Participant Matrix

Maximilian Lam Gu-Yeon Wei David Brooks Vijay Janapa Reddi Michael Mitzenmacher

We show that aggregated model updates in federated learning may be insecure. An untrusted central server may disaggregate user updates from sums of updates across participants given repeated observations, enabling the server to recover privileged information about individual users' private training data via traditional gradient inference attacks. Our method revolves around reconstructing participant information (e.g: which rounds of training users participated in) from aggregated model updates by leveraging summary information from device analytics commonly used to monitor, debug, and manage federated learning systems. Our attack is parallelizable and we successfully disaggregate user updates on settings with up to thousands of participants. We quantitatively and qualitatively demonstrate significant improvements in the capability of various inference attacks on the disaggregated updates. Our attack enables the attribution of learned properties to individual users, violating anonymity, and shows that a determined central server may undermine the secure aggregation protocol to break individual users' data privacy in federated learning.

Federated Learning

Cite

Citations (1)

Understanding Voltage Variations in Chip Multiprocessors using a Distributed Power-Delivery Network

Meeta S. Gupta Jarod L. Oatley Russ Joseph Gu-Yeon Wei David Brooks

Recent efforts to address microprocessor power dissipation through aggressive supply voltage scaling and power management require that designers be increasingly cognizant of power supply variations. These variations, primarily due to fast changes in supply current, can be attributed to architectural gating events that reduce power dissipation. In order to study this problem, the authors propose a fine-grain, parameterizable model for power-delivery networks that allows system designers to study localized, on-chip supply fluctuations in high-performance microprocessors. Using this model, the authors analyze voltage variations in the context of next-generation chip-multiprocessor (CMP) architectures using both real applications and synthetic current traces. They find that the activity of distinct cores in CMPs present several new design challenges when considering power supply noise, and they describe potentially problematic activity sequences that are unique to CMP architectures

Microprocessor

Power gating

Power domains

10.1109/date.2007.364663

Cite

Citations (155)

A 16-nm Always-On DNN Processor With Adaptive Clocking and Multi-Cycle Banked SRAMs

IEEE Journal of Solid-State Circuits (2019)

Sae Kyu Lee Paul N. Whatmough David Brooks Gu-Yeon Wei

Always-on subsystems in mobile/Internet of Things (IoT) SoCs process a variety of real-time sensor data deep neural network (DNN) classification workloads in a heavily constrained energy budget. This can be achieved with robust, low-voltage circuits, and specialized hardware accelerators. We present a 16-nm always-on DNN processor, which consists primarily of a microcontroller and a DNN accelerator with on-chip SRAM for the model weights. The design operates robustly from 0.4 to 1-V, with calibration-free automatic voltage/frequency tuning provided by tracking small non-zero razor timing error rates. A novel timing error-driven synchronization-free adaptive clocking scheme significantly reduces the adaptation latency to provide resilience to fast on-chip supply noise and reduce margins. To accommodate the tight energy constraints of always-on IoT workloads, we implement a multi-cycle SRAM read scheme that allows the memory voltage to scale at iso-throughput, improving energy efficiency across the entire operating range. The wide operating range allows for high performance at 1.36 GHz, low-power consumption downs to 750 μW, and stateof-the-art raw efficiency at 16-bit precision of 750 GOPS/W dense or 1.81 TOPS/W sparse.

10.1109/jssc.2019.2913098

Cite

Citations (23)

Demystifying Bayesian Inference Workloads

Yu Emma Wang Yuhao Zhu Glenn G. Ko Brandon Reagen Gu-Yeon Wei

The recent surge of machine learning has motivated computer architects to focus intently on accelerating related workloads, especially in deep learning. Deep learning has been the pillar algorithm that has led the advancement of learning patterns from a vast amount of labeled data, or supervised learning. However, for unsupervised learning, Bayesian methods often work better than deep learning. Bayesian modeling and inference works well with unlabeled or limited data, can leverage informative priors, and has interpretable models. Despite being an important branch of machine learning, Bayesian inference generally has been overlooked by the architecture and systems communities. In this paper, we facilitate the study of Bayesian inference with the development of BayesSuite, a collection of seminal Bayesian inference workloads. We characterize the power and performance profiles of BayesSuite across a variety of current-generation processors and find significant diversity. Manually tuning and deploying Bayesian inference workloads requires deep understanding of the workload characteristics and hardware specifications. To address these challenges and provide high-performance, energy-efficient support for Bayesian inference, we introduce a scheduling and optimization mechanism that can be plugged into a system scheduler. We also propose a computation elision technique that further improves the performance and energy efficiency of the workloads by skipping computations that do not improve the quality of the inference. Our proposed techniques are able to increase Bayesian inference performance by 5.8 × on average over the naive assignment and execution of the workloads.

Leverage (statistics)

Bayesian Optimization

10.1109/ispass.2019.00031

Cite

Citations (10)

Steering Committee

Antonia Zhai Carole-Jean Wu David Brooks David Kaeli Devesh Tiwari

10.1109/iiswc55918.2022.00009

Cite

Citations (0)

Implementing a hybrid SRAM / eDRAM NUCA architecture

Javier Lira Carlos Molina David Brooks Antonio Gonzàlez

Advances in technology allowed for integrating DRAM-like structures into the chip, called embedded DRAM (eDRAM). This technology has already been successfully implemented in some GPUs and other graphic-intensive SoC, like game consoles. The most recent processor from IBM ^® , POWER7, is the first general-purpose processor that integrates an eDRAM module on the chip. In this paper, we propose a hybrid cache architecture that exploits the main features of both memory technologies, speed of SRAM and high density of eDRAM. We demonstrate, that due to the high locality found in emerging applications, a high percentage of data that enters to the on-chip last-level cache are not accessed again before they are evicted. Based on that observation, we propose a placement scheme where re-accessed data blocks are stored in fast, but costly in terms of area and power, SRAM banks, while eDRAM banks store data blo cks that just arrive to the NUCA cache or were demoted from a SRAM bank. We show that a well-balanced SRAM / eDRAM NUCA cache can achieve similar performance results than using a NUCA cache composed of only SRAM banks, but reduces area by 15% and power consumed by 10%. Furthermore, we also explore several alternatives to exploit the area reduction we gain by using the hybrid architecture, resulting in an overall performance improvement of 4%.

Dram

Universal Memory

IBM

10.1109/hipc.2011.6152738

Cite

Citations (9)

AutoPilot: Automating SoC Design Space Exploration for SWaP Constrained Autonomous UAVs.

arXiv (Cornell University) (2021)

Srivatsan Krishnan Zishen Wan Kshitij Bhardwaj Paul N. Whatmough Aleksandra Faust

Building domain-specific accelerators for autonomous unmanned aerial vehicles (UAVs) is challenging due to a lack of systematic methodology for designing onboard compute. Balancing a computing system for a UAV requires considering both the cyber (e.g., sensor rate, compute performance) and physical (e.g., payload weight) characteristics that affect overall performance. Iterating over the many component choices results in a combinatorial explosion of the number of possible combinations: from 10s of thousands to billions, depending on implementation details. Manually selecting combinations of these components is tedious and expensive. To navigate the {cyber-physical design space} efficiently, we introduce \emph{AutoPilot}, a framework that automates full-system UAV co-design. AutoPilot uses Bayesian optimization to navigate a large design space and automatically select a combination of autonomy algorithm and hardware accelerator while considering the cross-product effect of other cyber and physical UAV components. We show that the AutoPilot methodology consistently outperforms general-purpose hardware selections like Xavier NX and Jetson TX2, as well as dedicated hardware accelerators built for autonomous UAVs, across a range of representative scenarios (three different UAV types and three deployment environments). Designs generated by AutoPilot increase the number of missions on average by up to 2.25x, 1.62x, and 1.43x for nano, micro, and mini-UAVs respectively over baselines. Our work demonstrates the need for holistic full-UAV co-design to achieve maximum overall UAV performance and the need for automated flows to simplify the design process for autonomous cyber-physical systems.

Autopilot

Payload (computing)

Cyber-physical system

Source

Cite

Citations (3)

A Wide Dynamic Range Sparse FC-DNN Processor with Multi-Cycle Banked SRAM Read and Adaptive Clocking in 16nm FinFET

ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC) (2018)

Sae Kyu Lee Paul N. Whatmough Niamh Mulholland Patrick Hansen David Brooks

Always-on classifiers for sensor data require a very wide operating range to support a variety of real-time workloads and must operate robustly at low supply voltages. We present a 16nm always-on wake-up controller with a fully-connected (FC) Deep Neural Network (DNN) accelerator that operates from 0.4-1 V. Calibration-free automatic voltage/frequency tuning is provided by tracking small non-zero Razor timing-error rates, and a novel timing-error driven sync-free fast adaptive clocking scheme provides resilience to on-chip supply voltage noise. The model access burden of neural networks is relaxed using a multicycle SRAM read, which allows memory voltage to be reduced at iso-throughput. The wide operating range allows for high performance at 1.36GHz, low-power consumption down to 750μW and state-of-the-art raw efficiency at 16-bit precision of 750 GOPS/W dense, or 1.81 TOPS/W sparse.

sync

10.1109/esscirc.2018.8494245

Cite

Citations (4)

Research Infrastructures for Hardware Accelerators

Synthesis lectures on computer architecture (2015)

Yakun Sophia Shao David Brooks

Hardware acceleration in the form of customized datapath and control circuitry tuned to specific applications has gained popularity for its promise to utilize transistors more efficiently. Historicall

Datapath

Popularity

Hardware acceleration

10.2200/s00677ed1v01y201511cac034

Cite

Citations (20)