Proving the equivalence between SQL queries is a fundamental problem in database research. Existing solvers model queries using algebraic representations and convert such representations into first-order logic formulas so that query equivalence can be verified by solving a satisfiability problem. The main challenge lies in "unbounded summations", which appear commonly in a query's algebraic representation in order to model common SQL features, such as projection and aggregate functions. Unfortunately, existing solvers handle unbounded summations in an ad-hoc manner based on heuristics or syntax comparison, which severely limits the set of queries that can be supported. This paper develops a new SQL equivalence prover called SQLSolver, which can handle unbounded summations in a principled way. Our key insight is to use the theory of LIA^*, which extends linear integer arithmetic formulas with unbounded sums and provides algorithms to translate a LIA^* formula to a LIA formula that can be decided using existing SMT solvers. We augment the basic LIA^* theory to handle several complex scenarios (such as nested unbounded summations) that arise from modeling real-world queries. We evaluate SQLSolver with 359 equivalent query pairs derived from the SQL rewrite rules in Calcite and Spark SQL. SQLSolver successfully proves 346 pairs of them, which significantly outperforms existing provers.
This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key principle underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity. The evaluation shows that PowerInfer significantly outperforms llama.cpp by up to 11.69× while retaining model accuracy across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU. For the OPT-30B model, PowerInfer achieves performance comparable to that of a high-end server-grade A100 GPU, reaching 82% of its token generation rate on a single consumer-grade RTX 4090 GPU.
This paper presents POLUS, a software maintenance tool capable of iteratively evolving running software into newer versions. POLUS's primary goal is to increase the dependability of contemporary server software, which is frequently disrupted either by external attacks or by scheduled upgrades. To render POLUS both practical and powerful, we design and implement POLUS aiming to retain backward binary compatibility, support for multithreaded software and recover already tainted state of running software, yet with good usability and very low runtime overhead. To demonstrate the applicability of POLUS, we report our experience in using POLUS to dynamically update three prevalent server applications: vsftpd, sshd and apache HTTP server. Performance measurements show that POLUS incurs negligible runtime overhead: a less than 1% performance degradation (but 5% for one case). The time to apply an update is also minimal.
IME (input method editor) apps are the primary means of interaction on mobile touch screen devices and thus are usually granted with access to a wealth of private user input. In order to understand the (in)security of mobile IME apps, this paper first performs a systematic study and uncovers that many IME apps may (intentionally or unintentionally) leak users' sensitive data to the outside world (mainly due to the incentives of improving the user's experience). To thwart the threat of sensitive information leakage while retaining the benefits of an improved user experience, this paper then proposes I-BOX, an app-transparent oblivious sandbox that minimizes sensitive input leakage by confining untrusted IME apps to predefined security policies. Several key challenges have to be addressed due to the proprietary and closed-source nature of most IME apps and the fact that an IME app can arbitrarily store and transform user input before sending it out. By designing system-level transactional execution, I-BOX works seamlessly and transparently with IME apps. Specifically, I-BOX first checkpoints an IME app's state before the first keystroke of an input, monitors and analyzes the user's input, and rolls back the state to the checkpoint if it detects the potential danger that sensitive input may be leaked. A proof of concept I-BOX prototype has been built for Android and tested with a set of popular IME apps. Experimental results show that I-BOX is able to thwart the leakage of sensitive input for untrusted IME apps, while incurring very small runtime overhead and little impact on user experience.
The confidentiality of tenant's data is confronted with high risk when facing hardware attacks and privileged malicious software. Hardware-based memory encryption is one of the promising means to provide strong guarantees of data security. Recently AMD has proposed its new memory encryption hardware called SME and SEV, which can selectively encrypt memory regions in a fine-grained manner, e.g., by setting the C-bits in the page table entries. More importantly, SEV further supports encrypted virtual machines. This, intuitively, has provided a new opportunity to protect data confidentiality in guest VMs against an untrusted hypervisor in the cloud environment. In this paper, we first provide a security analysis on the (in)security of SEV and uncover a set of security issues of using SEV as a means to defend against an untrusted hypervisor. Based on the study, we then propose a software-based extension to the SEV feature, namely Fidelius, to address those issues while retaining performance efficiency. Fidelius separates the management of critical resources from service provisioning and revokes the permissions of accessing specific resources from the un-trusted hypervisor. By adopting a sibling-based protection mechanism with non-bypassable memory isolation, Fidelius embraces both security and efficiency, as it introduces no new layer of abstraction. Meanwhile, Fidelius reuses the SEV API to provide a full VM life-cycle protection, including two sets of para-virtualized I/O interfaces to encode the I/O data, which is not considered in the SEV hardware design. A detailed and quantitative security analysis shows its effectiveness in protecting tenant's data from a variety of attack surfaces, and the performance evaluation confirms the performance efficiency of Fidelius.
Natural graphs with skewed distributions raise unique challenges to distributed graph computation and partitioning. Existing graph-parallel systems usually use a “one-size-fits-all” design that uniformly processes all vertices, which either suffer from notable load imbalance and high contention for high-degree vertices (e.g., Pregel and GraphLab) or incur high communication cost and memory consumption even for low-degree vertices (e.g., PowerGraph and GraphX). In this article, we argue that skewed distributions in natural graphs also necessitate differentiated processing on high-degree and low-degree vertices. We then introduce PowerLyra, a new distributed graph processing system that embraces the best of both worlds of existing graph-parallel systems. Specifically, PowerLyra uses centralized computation for low-degree vertices to avoid frequent communications and distributes the computation for high-degree vertices to balance workloads. PowerLyra further provides an efficient hybrid graph partitioning algorithm (i.e., hybrid-cut) that combines edge-cut (for low-degree vertices) and vertex-cut (for high-degree vertices) with heuristics. To improve cache locality of inter-node graph accesses, PowerLyra further provides a locality-conscious data layout optimization. PowerLyra is implemented based on the latest GraphLab and can seamlessly support various graph algorithms running in both synchronous and asynchronous execution modes. A detailed evaluation on three clusters using various graph-analytics and MLDM (Machine Learning and Data Mining) applications shows that PowerLyra outperforms PowerGraph by up to 5.53X (from 1.24X) and 3.26X (from 1.49X) for real-world and synthetic graphs, respectively, and is much faster than other systems like GraphX and Giraph, yet with much less memory consumption. A porting of hybrid-cut to GraphX further confirms the efficiency and generality of PowerLyra.
Hardware-assisted enclaves with memory encryption have been widely adopted in the prevailing architectures, e.g., Intel SGX/TDX, AMD SEV, ARM CCA, etc. However, existing enclave designs fall short in supporting efficient cooperation among cross-node enclaves (i.e., multi-machines) because the range of hardware memory protection is within a single node. A naive approach is to leverage cryptography at the application level and transfer data between nodes through secure channels (e.g., SSL). However, it incurs orders of magnitude costs due to expensive encryption/decryption, especially for distributed applications with large data transfer, e.g., MapReduce and graph computing. A secure and efficient mechanism of distributed secure memory is necessary but still missing.This paper proposes Migratable Merkle Tree (MMT), a design enabling efficient distributed secure memory to support distributed confidential computing. MMT sets up an integrity forest for distributed memory on multiple nodes. It allows an enclave to securely delegate an MMT closure, which contains both data and metadata of a subtree, to a remote enclave. By reusing the memory encryption mechanisms of existing enclaves, our design achieves secure data transfer without software re-encryption. We have implemented a prototype of MMT and a trusted firmware for management, and further applied MMT to real-world distributed applications. The evaluation results show that compared with existing systems using the AES-NI instruction, MMT can achieve up to 13x speedup on data transferring, and gain 12%~58% improvement on the end-to-end performance of MapReduce and PageRank.
Crash dump, or core dump is the typical way to save memory image on system crash for future offline debugging and analysis. However, for typical server machines with likely abundant memory, the time of core dump can significantly increase the mean time to repair (MTTR) by delaying the reboot-based recovery, while not dumping the failure context for analysis would risk recurring crashes on the same problems.In this paper, we propose several optimization techniques for core dump in virtualized environments, in order to shorten the MTTR of consolidated virtual machines during crashes. First, we parallelize the process of crash dump and the process of rebooting the crashed VM, by dynamically reclaiming and allocating memory between the crashed VM and the newly spawned VM. Second, we use the virtual machine management layer to introspect the critical data structures of the crashed VM to filter out the dump of unused memory. Finally, we implement disk I/O rate control between core dump and the newly spawned VM according to user-tuned rate control policy to balance the time of crash dump and quality of services in the recovery VM.We have implemented a working prototype, Vicover, that optimizes core dump on system crash of a virtual machine in Xen, to minimize the MTTR of core dump and recovery as a whole. In our experiment on a virtualized TPC-W server, Vicover shortens the downtime caused by crash dump by around 5X.
Trusted Execution Environments (TEEs), like Intel SGX/TDX, AMD SEV-SNP, ARM TrustZone/CCA, have been widely adopted in prevailing architectures. However, these TEEs typically do not consider I/O isolation (e.g., defending against malicious DMA requests) as a first-class citizen, which may degrade the I/O performance. Traditional methods like using IOMMU or software I/O can degrade throughput by at least 20% for I/O intensive workloads. The main reason is that the isolation requirements for I/O devices differ from CPU ones. This paper proposes a novel I/O isolation mechanism for TEEs, named sIOPMP (scalable I/O Physical Memory Protection), with three key features. First, we design a Multi-stage-Tree-based checker, supporting more than 1,000 hardware regions. Second, we classify the devices into hot and cold, and support unlimited devices with the mountable entry. Third, we propose a remapping mechanism to switch devices between hot and cold status for dynamic I/O workloads. Evaluation results show that sIOPMP introduces only negligible performance overhead for both benchmarks and real-world workloads, and improves 20% ~ 38% network throughput compared with IOMMU-based mechanisms or software I/O adopted in TEEs.