Recently, smartphone technologies have evolved quickly and offered end users the computing power and networking capabilities required to perform useful network and multimedia applications. However, due to limited physical sizes and battery capacities, the current generation of smartphones cannot yet fulfill the requirements of sophisticated applications of which personal computers are capable. One way to solve this problem is to minimize the workload on a smartphone as much as possible by offloading portions of an application to a server. The solution is particularly attractive today as cloud computing provides the needed server resources at relatively low costs. This paper proposes a novel, lightweight application migration mechanism for the users of smartphones to suspend the execution of applications and offload them to the cloud. The authors also developed a framework to perform Android applications efficiently with virtual phones in the cloud with a virtual storage. This paper discusses the migration mechanism and evaluates its effectiveness on the Android smartphone. This approach may effectively offload workload for Android applications even with low-speed mobile network.
Quantum approximate optimization algorithm (QAOA) is one of the popular quantum algorithms that are used to solve combinatorial optimization problems via approximations. QAOA is able to be evaluated on both physical and virtual quantum computers simulated by classical computers, with virtual ones being favored for their noise-free feature and availability. Nevertheless, performing QAOA on virtual quantum computers suffers from a slow simulation speed for solving combinatorial optimization problems which require large-scale quantum circuit simulation (QCS). In this paper, we propose techniques to accelerate QCS for QAOA using mathematical optimizations to compress quantum operations, incorporating efficient bitwise operations to further lower the computational complexity, and leveraging different levels of parallelisms from modern multi-core processors, with a study case to show the effectiveness on solving max-cut problems.
Simulation is a common approach for assisting system design and optimization. For system-wide optimization, energy and computational resources are often the two most critical limitations. Modeling energy-states of each hardware component and time spent in each state is needed for accurate energy and performance prediction. Tracking software execution in a realistic operating environment with properly modeled input/output is key to accurate prediction. However, the conventional approaches can have difficulties in practice. First, for a complex system such as an Android smartphone, building a cycle-accurate simulation environment is no easy task. Secondly, for I/O-intensive applications, a slow simulation would significantly alter the application behavior and change its performance profile. Thirdly, conventional software profiling tools generally do not work on simulators, which makes it difficult for performance analysis of complicated software, e.g., Java applications executed by the Dalvik virtual machine. Recently, virtual machine technologies are widely used to emulate a variety of computer systems. While virtual machines do not model the hardware components in the emulated system, we can ease the effort of building a simulation environment by leveraging the infrastructure of virtual machines and adding performance and power models. Moreover, multiple sets of the performance and energy models can be selectively used to verify if the speed of the simulated system impacts the software behavior. Finally, performance monitoring facilities can be integrated to work with profiling tools. We believe this approach should help overcome the aforementioned difficulties. We have prototyped a framework and our case studies showed that the information provided by our tools are useful for software optimization and system design for Android smartphones.
Monte Carlo methods are often used to solve computational problems with randomness. The random sampling helps avoid the deterministic results, but it requires intensive computations to obtain the results. Several attempts have been made to boost the performance of the Monte Carlo based algorithms by taking advantage of the parallel computers. In this paper, we use the photonic simulation application, MCML, as a case study to 1) parallelize the Monte Carlo method with OpenMP and vectorization, 2) compare the parallelization techniques, and 3) evaluate the parallelized programs on the platforms with the Xeon Phi processor. In particular, the OpenMP version incorporates the vectorization technique that utilizes the AVX-512 vector instructions on the Xeon Phi processor. Our experimental results show that the OpenMP code achieves up to 345x speedup on the Xeon Phi processor, compared with the original code runs on the Xeon E5 processor.
The time domain modeling of large scale cosite interference problems, arising in wireless communication channel analysis and design studies, is discussed. It is pointed out that the use of the conventional finite difference time domain method for the treatment of such problems typically results in long, computationally burdensome simulations that limit the ability of this technique to provide an efficient CAD-oriented tool for commercial and military applications. For the purpose of accelerating these simulations, the use of wavelet based time domain solvers along with parallelization techniques is proposed.
We have developed a hierarchical performance bounding methodology that attempts to explain the performance of loop-dominated scientific applications on particular systems. The Kendall Square Research KSR1 is used as a running example. We model the throughput of key hardware units that arc common bottlenecks in concurrent machines. The four units currently used are: memory port, floating-point, instruction issue, and a loop-carried dependence pseudo-unit. We propose a workload characterization, and derive upper bounds on the performance of specific machine-workload pairs. Comparing delivered performance with bounds focuses attention on areas for improvement and indicates how much improvement might be attainable. We delineate a comprehensive approach to modeling and improving application performance on the KSR1. Application of this approach is being automated for the KSR1 with a series of tools including K-MA and K-MACSTAT (which enable the calculation of the MACS hierarchy of performance bounds), K-Trace (which allows parallel code to be instrumented to produce a memory reference trace), and K-Cache (which simulates inter-cache communications based on a memory reference trace).
Machine-to-machine (M2M) communications have been recently employed in various application domains, including smart homes, surveillance, remote control, healthcare, etc., where heterogeneous devices interact with each other via heterogeneous networks. The architecture of an M2M system is critical to its cost and performance, especially for those applications with strict real-time requirements. In addition, energy consumption is important to those devices which are powered by batteries. Considering the heterogeneity of devices and networks, the complexity for developing and evaluating the cost, performance and energy consumption of an M2M system can be very challenging, as it requires the developer to deal with issues which do not exist in conventional systems. This chapter introduces a framework for evaluating the performance of M2M systems via simulation. The framework enables the user to quickly model an M2M system by running the M2M application over virtual machines and virtual network devices. The timing models in the virtual machines and a virtual network devices are design parameters taken from the actual system. The results of the simulation reveal the details of execution and estimate the energy consumption for the M2M system. As illustrated in the case studies in this chapter, with such a tool, the developer may explore the design space to search for cost-effective or energy-efficient designs.
Understanding the program behavior and data dependencies are important when designing and accelerating applications. However, conventional profiling tools are insufficient for tracking functions and loops of programs due to compiler optimizations and probe effects. In order to minimize the probe effects, virtual platforms with timing simulation are used to monitor the profiled program and provide flexibility of evaluating the future platforms. Nevertheless, the profiling information is not collected in function- or looplevel for programmers to analyze and discover performance issues. This paper proposes a stack-pointer-based method with a later loop entry detection scheme to overcome the difficulties of detecting functions and loops for programs running on a virtual platform. With the detailed performance counters and memory access patterns recorded along with the loop-call context tree, this paper also presents a framework collecting traces for detailed analysis on both of control flow and data flow of a program. The experimental results demonstrated the ability of the developed tool for collecting and profiling a program in a loop-call context tree form and for enabling further analysis on thread level parallelism and data dependency between functions and loops.