This research presents a systematic methodology for producing accurate power models for single instruction set architecture (ISA) heterogeneous processors. We use the hardware event counters from the processor performance monitoring unit (PMU) to accurately capture the CPU states and ordinary least squares (OLS), assisted by automated event selection algorithms, to compute the power models. Several estimators for single-thread and multi-thread benchmarks are proposed capable of performing power predictions across different frequency levels for one processor as well as between the heterogeneous processors with less than 3% error. The models are compared to related work showing significant improvement in accuracy and good computational efficiency which makes them suitable for run-time deployment.
In this paper, we investigate the application of early-exit strategies to quantized neural networks with binarized weights, mapped to low-cost FPGA SoC devices. The increasing complexity of network models means that hardware reuse and heterogeneous execution are needed and this opens the opportunity to evaluate the prediction confidence level early on. We apply the early-exit strategy to a network model suitable for ImageNet classification that combines weights with floating-point and binary arithmetic precision. The experiments show an improvement in inferred speed of around 20% using an early-exit network, compared with using a single primary neural network, with a negligible accuracy drop of 1.56%.
Non-functional properties, such as energy, time, and security (ETS) are becoming increasingly important for the programming of Cyber-Physical Systems (CPS).This paper describes TeamPlay, a research project funded under the EU Horizon 2020 programme between January 2018 and June 2021.TeamPlay aimed to provide the system designer with a toolchain for developing embedded applications where ETS properties are first-class citizens, allowing the developer to reflect directly on energy, time and security properties at the source code level.In this paper we give an overview of the TeamPlay methodology, introduce the challenges and solutions of our approach and summarise the results achieved.Overall, applying our TeamPlay methodology led to an improvement of up to 18% performance and 52% energy usage over traditional approaches.
This paper presents a methodology for simultaneous heterogeneous computing, named ENEAC, where a quad core ARM Cortex-A53 CPU works in tandem with a preprogrammed on-board FPGA accelerator. A heterogeneous scheduler distributes the tasks optimally among all the resources and all compute units run asynchronously, which allows for improved performance for irregular workloads. ENEAC achieves up to 17\% performance improvement \ignore{and 14\% energy usage reduction,} when using all platform resources compared to just using the FPGA accelerators and up to 865\% performance increase \ignore{and up to 89\% energy usage decrease} when using just the CPU. The workflow uses existing commercial tools and C/C++ as a single programming language for both accelerator design and CPU programming for improved productivity and ease of verification.
Graph processing is an area that has received significant attention in recent years due to the substantial expansion in industries relying on data analytics. Alongside the vital role of finding relations in social networks, graph processing is also widely used in transportation to find optimal routes and biological networks to analyse sequences. The main bottleneck in graph processing is irregular memory accesses rather than computation intensity. Since computational intensity is not a driving factor, we propose a method to perform graph processing at the edge more efficiently. We believe current cloud computing solutions are still very costly and have latency issues. The results demonstrate the benefits of a dedicated sparse graph processing algorithm compared with dense graph processing when analysing data with low density. As graph datasets grow exponentially, traversal algorithms such as breadth-first search (BFS), fundamental to many graph processing applications and metrics, become more costly to compute. Our work focuses on reviewing other implementations of breadth-first search algorithms designed for low power systems and proposing our solution that utilises advanced enhancements to achieve a significant performance boost up to 9.2x better performance in terms of MTEPS compared to other state-of-the-art solutions with a power usage of 2.32W.
This paper investigates the application of a robust CPU-based power modelling methodology that performs an automatic search of explanatory events derived from performance counters to embedded GPUs. A 64-bit Tegra TX1 SoC is configured with DVFS enabled and multiple CUDA benchmarks are used to train and test models optimized for each frequency and voltage point. These optimized models are then compared with a simpler unified model that uses a single set of model coefficients for all frequency and voltage points of interest. To obtain this unified model, a number of experiments are conducted to extract information on idle, clock and static power to derive power usage from a single reference equation. The results show that the unified model offers competitive accuracy with an average 5\% error with four explanatory variables on the test data set and it is capable to correctly predict the impact of voltage, frequency and temperature on power consumption. This model could be used to replace direct power measurements when these are not available due to hardware limitations or worst-case analysis in emulation platforms.
Energy modelling can enable energy-aware software development and assist the developer in meeting an application's energy budget. Although many energy models for embedded processors exist, most do not account for processor-specific config-urations, neither are they suitable for static energy consumption estimation. This paper introduces a set of comprehensive energy models for Arm's Cortex-M0 processor, ready to support energy-aware development of edge computing applications using either profiling- or static-analysis-based energy consumption estimation. We use a commercially representative physical platform together with a custom modified Instruction Set Simulator to obtain the physical data and system state markers used to generate the models. The models account for different processor configurations which all have a significant impact on the execution time and energy consumption of edge computing applications. Unlike existing works, which target a very limited set of applications, all developed models are generated and validated using a very wide range of benchmarks from a variety of emerging IoT application areas, including machine learning and have a prediction error of less than 5%.
This paper investigates the application of a robust CPU-based power modelling methodology that performs an automatic search of explanatory events derived from performance counters to embedded GPUs. A 64-bit Tegra TX1 SoC is configured with DVFS enabled and multiple CUDA benchmarks are used to train and test models optimized for each frequency and voltage point. These optimized models are then compared with a simpler unified model that uses a single set of model coefficients for all frequency and voltage points of interest. To obtain this unified model, a number of experiments are conducted to extract information on idle, clock and static power to derive power usage from a single reference equation. The results show that the unified model offers competitive accuracy with an average 5% error with four explanatory variables on the test data set and it is capable to correctly predict the impact of voltage, frequency and temperature on power consumption. This model could be used to replace direct power measurements when these are not available due to hardware limitations or worst-case analysis in emulation platforms.
Heterogeneous processors, formed by binary compatible CPU cores with different microarchitectures, enable energy reductions by better matching processing capabilities and software application requirements. This new hardware platform requires novel techniques to manage power and energy to fully utilize its capabilities, particularly regarding the mapping of workloads to appropriate cores. In this paper we validate relevant published work related to power modelling for heterogeneous systems and propose a new approach for developing run-time power models that uses a hybrid set of physical predictors, performance events and CPU state information. We demonstrate the accuracy of this approach compared with the state-of-the-art and its applicability to energy aware scheduling. Our results are obtained on a commercially available platform built around the Samsung Exynos 5 Octa SoC, which features the ARM big.LITTLE heterogeneous architecture.
Energy modeling can enable energy-aware software development and assist the developer in meeting an application's energy budget. Although many energy models for embedded processors exist, most do not account for processor-specific configurations, neither are they suitable for static energy consumption estimation. This paper introduces a comprehensive energy model for Arm's Cortex-M0 processor, ready to support energy-aware development of edge computing applications using either profiling- or static-analysis-based energy consumption estimation. The model accounts for the Frequency, PreFetch, and WaitState processor configurations which all have a significant impact on the execution time and energy consumption of edge computing applications. All models have a prediction error of less than 5%.