Stochastic gradient descent (SGD) is a well-known method for regression and classification tasks. However, it is an inherently sequential algorithm - at each step, the processing of the current example depends on the parameters learned from previous examples. Prior approaches to parallelizing linear learners using SGD, such as Hogwild! and AllReduce, do not honor these dependencies across threads and thus can potentially suffer poor convergence rates and/or poor scalability. This paper proposes SymSGD, a parallel SGD algorithm that, to a first-order approximation, retains the sequential semantics of SGD. Each thread learns a local model in addition to a model combiner, which allows local models to be combined to produce the same result as what a sequential SGD would have produced. This paper evaluates SymSGD's accuracy and performance on 6 datasets on a shared-memory machine shows up-to 11x speedup over our heavily optimized sequential baseline on 16 cores and 2.2x, on average, faster than Hogwild!.
Stochastic gradient descent (SGD) is a well known method for regression and classification tasks. However, it is an inherently sequential algorithm at each step, the processing of the current example depends on the parameters learned from the previous examples. Prior approaches to parallelizing linear learners using SGD, such as HOGWILD! and ALLREDUCE, do not honor these dependencies across threads and thus can potentially suffer poor convergence rates and/or poor scalability. This paper proposes SYMSGD, a parallel SGD algorithm that, to a first-order approximation, retains the sequential semantics of SGD. Each thread learns a local model in addition to a model combiner, which allows local models to be combined to produce the same result as what a sequential SGD would have produced. This paper evaluates SYMSGD's accuracy and performance on 6 datasets on a shared-memory machine shows upto 11x speedup over our heavily optimized sequential baseline on 16 cores and 2.2x, on average, faster than HOGWILD!.
This work provides a theoretical framework for the pose estimation problem using total least squares for vector observations from landmark features. First, the optimization framework is formulated with observation vectors extracted from point cloud features. Then, error-covariance expressions are derived. The attitude and position solutions obtained via the derived optimization framework are proven to reach the bounds defined by the Cramer-Rao lower bound under the small-angle approximation of attitude errors. The measurement data for the simulation of this problem is provided through a series of vector observation scans, and a fully populated observation noise-covariance matrix is assumed as the weight in the cost function to cover the most general case of the sensor uncertainty. Here, previous derivations are expanded for the pose estimation problem to include more generic correlations in the errors than previous cases involving an isotropic noise assumption. The proposed solution is simulated in a Monte-Carlo framework to validate the error-covariance analysis.
It has been widely shown that high-throughput computing architectures such as GPUs offer large performance gains compared with their traditional low-latency counterparts for many applications. The downside to these architectures is that the current programming models present numerous challenges to the programmer: lower-level languages, loss of portability across different architectures, explicit data movement, and challenges in performance optimization. This paper presents novel methods and compiler transformations that increase programmer productivity by enabling users of the language Chapel to provide a single code implementation that the compiler can then use to target not only conventional multiprocessors, but also high-throughput and hybrid machines. Rather than resorting to different parallel libraries or annotations for a given parallel platform, this work leverages a language that has been designed from first principles to address the challenge of programming for parallelism and locality. This also has the advantage of providing portability across different parallel architectures. Finally, this work presents experimental results from the Parboil benchmark suite which demonstrate that codes written in Chapel achieve performance comparable to the original versions implemented in CUDA on both GPUs and multicore platforms.
The Single-Source Shortest Path (SSSP) problem is to find the shortest paths from a source vertex to all other vertices in a graph. In this paper, we introduce the Dijkstra Strip-Mined Relaxation (DSMR) algorithm, an efficient parallel SSSP algorithm for shared and distributed memory systems. Our results show that, DSMR is faster than parallel Δ-Stepping by a factor of up-to 1.66.
The broad landscape of new technologies currently being explored makes the current times very exciting for computer systems research. The community is actively researching an extensive set of topics, ranging from the small (e.g., energy-independent embedded devices) to the large (e.g., brain-scale deep learning), simultaneously addressing technology discontinuities (End of Moore's Law and EnergyWall), new challenges in security and privacy, and the rise of artificial intelligence (AI).
While industry is applying some of these technologies, its efforts are necessarily focused on only a few areas, and on relatively short-term horizons. This offers academic researchers the opportunity to attack the problems with a broader and longer-term view. Further, in recent times, the computer systems community has started to pay increasing attention to non-performance measures, such as security, complexity, and power. To make progress in this multi-objective world, the composition of research teams needs to change. Teams have to become inter-disciplinary, enabling the flow of ideas across computing fields.
While many research directions are interesting, this report outlines a few high-priority areas where inter-disciplinary research is likely to have a high payoff:
a) Developing the components for a usable planet-scale Internet of Things (IoT), with provably energy-efficient devices. This report envisions a highly-available, geographically distributed, heterogeneous large-scale IoT system with the same efficiency, maintainability, and usability as today's data centers. This planet-scale IoT will be populated by many computationally-sophisticated IoT devices that are ultra-low power and operate energy-independently.
b) Rethinking the hardware-software security contract in the age of AI. In light of the recent security vulnerabilities, this report argues for building hardware abstractions that communicate security guarantees, and for allowing software to communicate its security and privacy requirements to the hardware. Further, security and privacy mechanisms should be integrated into the disruptive emerging technologies that support AI.
c) Making AI a truly dependable technology that is usable by all the citizens in all settings. As AI frameworks automate an increasing number of critical operations, this report argues for end-to-end dependable AI, where both the hardware and the software are understood and verified. Further, AI needs to turn from a centralized tool into a capability easily usable by all the citizens in all settings to meet an ever expanding range of needs.
d) Developing solutions to tackle extreme complexity, possibly based on formal methods. This report argues for the need to tame the explosion of system complexity and heterogeneity by creating new abstractions and complexity-management solutions. Such solutions need to be accessible to domain experts. An important step towards this goal is to scale out and extend formal methods for the real world.
This report also describes other, related research challenges.
Event-series pattern matching is a major component of large-scale data analytics pipelines enabling a wide range of system diagnostics tasks. A precursor to pattern matching is an expensive ``shuffle the world'' stage wherein data are ordered by time and shuffled across the network. Because many existing systems treat the pattern matching engine as a black box, they are unable to optimizing the entire data analytics pipeline, and in particular, this costly shuffle. This paper demonstrates how to optimize such queries. We first translate an expressive class of regular-expression like patterns to relational queries such that they can benefit from decades of progress in relational optimizers, and then we introduce the technique of abstract pattern matching, a linear time preprocessing step which, adapting ideas from symbolic execution and abstract interpretation, discards events from the input guaranteed not to appear in successful matches. Abstract pattern matching first computes a conservative representation of the output-relevant domain of every transition in a pattern based on the (unary) predicates of that transition. It then further refines these domains based on the structure of the pattern (i.e., paths through the pattern) as well as any of the pattern's join predicates across transitions. The outcome is an abstract filter that when applied to the original stream excludes events that are guaranteed not to participate in a match. We implemented and applied abstract pattern matching in COSMOS/Scope to an industrial benchmark where we obtained up to 3 orders of magnitude reduction in shuffled data and 1.23x average speedup in total processing time.
Large ML models and datasets have necessitated the use of multi-GPU systems for distributed model training. To harness the power offered by multi-GPU systems, it is critical to eliminate bottlenecks in inter-GPU communication - a problem made challenging by the heterogeneous nature of interconnects. In this work, we present TACCL, a synthesizer for collective communication primitives for large-scale multi-GPU systems. TACCL encodes a profiled topology and input size into a synthesis problem to generate optimized communication algorithms. TACCL is built on top of the standard NVIDIA Collective Communication Library (NCCL), allowing it to be a drop-in replacement for GPU communication in frameworks like PyTorch with minimal changes. TACCL generates algorithms for communication primitives like Allgather, Alltoall, and Allreduce that are up to $3\times$ faster than NCCL. Using TACCL's algorithms speeds up the end-to-end training of an internal mixture of experts model by $17\%$. By decomposing the optimization problem into parts and leveraging the symmetry in multi-GPU topologies, TACCL synthesizes collectives for up to 80-GPUs in less than 3 minutes, at least two orders of magnitude faster than other synthesis-based state-of-the-art collective communication libraries.
The Single-Source Shortest Path (SSSP) problem is to find the shortest paths from a source vertex to all other vertices in a graph. In this paper, we introduce the Dijkstra Strip-Mined Relaxation (DSMR) algorithm, an efficient parallel SSSP algorithm for shared and distributed memory systems. Our results show that, DSMR is faster than parallel Δ-Stepping by a factor of up-to 1.66.