ASPLOS '21: 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Conference Information
Name: ASPLOS '21: 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Latest articles from this conference

Bang Di, Jiawen Liu, Hao Chen, Dong Li
Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems; https://doi.org/10.1145/3445814.3446744

Abstract:
Debugging persistent memory (PM) programs faces a fundamental tradeoff between performance overhead and bug coverage (comprehensiveness). Large performance overhead or limited bug coverage makes debugging infeasible or ineffective for PM programs. We present PMDebugger, a debugger that detects crash consistency bugs in PM programs. Unlike prior work, PMDebugger is fast, flexible and comprehensive. The design of PMDebugger is driven by a characterization that shows how three fundamental operations in PM programs (store, cache writeback and fence) typically occur in PM programs. PMDebugger uses a hierarchical design composed of PM debugging-specific data structures, operations and bug-detection algorithms (rules). We generalize nine rules to detect crash-consistency bugs for various PM persistency models. Compared with a state-of-the-art detector (XFDetector) and an industry-quality detector (Pmemcheck), PMDebugger leads to 49.3x and 3.4x speedup on average. Compared with another state-of-the-art detector (PMTest) optimized for high performance, PMDebugger achieves comparable performance (within a factor of 2), without heavily relying on programmer annotations, and detects 38 more bugs on ten applications. PMDebugger also identifies more bugs than XFDetector and Pmemcheck. PMDebugger detects 19 new bugs in a real application (memcached) and two new bugs from Intel PMDK.
Xiangyu Zhang, Ramin Bashizade, Yicheng Wang, , Alvin R. Lebeck
Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems; https://doi.org/10.1145/3445814.3446697

Abstract:
Statistical machine learning often uses probabilistic models and algorithms, such as Markov Chain Monte Carlo (MCMC), to solve a wide range of problems. Probabilistic computations, often considered too slow on conventional processors, can be accelerated with specialized hardware by exploiting parallelism and optimizing the design using various approximation techniques. Current methodologies for evaluating correctness of probabilistic accelerators are often incomplete, mostly focusing only on end-point result quality ("accuracy"). It is important for hardware designers and domain experts to look beyond end-point "accuracy" and be aware of how hardware optimizations impact statistical properties. This work takes a first step toward defining metrics and a methodology for quantitatively evaluating correctness of probabilistic accelerators. We propose three pillars of statistical robustness: 1) sampling quality, 2) convergence diagnostic, and 3) goodness of fit. We apply our framework to a representative MCMC accelerator and surface design issues that cannot be exposed using only application end-point result quality. We demonstrate the benefits of this framework to guide design space exploration in a case study showing that statistical robustness comparable to floating-point software can be achieved with limited precision, avoiding floating-point hardware overheads.
Dmitrii Ustiugov, Plamen Petrov, Marios Kogias, Edouard Bugnion, Boris Grot
Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems; https://doi.org/10.1145/3445814.3446714

Abstract:
Serverless computing has seen rapid adoption due to its high scalability and flexible, pay-as-you-go billing model. In serverless, developers structure their services as a collection of functions, sporadically invoked by various events like clicks. High inter-arrival time variability of function invocations motivates the providers to start new function instances upon each invocation, leading to significant cold-start delays that degrade user experience. To reduce cold-start latency, the industry has turned to snapshotting, whereby an image of a fully-booted function is stored on disk, enabling a faster invocation compared to booting a function from scratch. This work introduces vHive, an open-source framework for serverless experimentation with the goal of enabling researchers to study and innovate across the entire serverless stack. Using vHive, we characterize a state-of-the-art snapshot-based serverless infrastructure, based on industry-leading Containerd orchestration framework and Firecracker hypervisor technologies. We find that the execution time of a function started from a snapshot is 95% higher, on average, than when the same function is memory-resident. We show that the high latency is attributable to frequent page faults as the function's state is brought from disk into guest memory one page at a time. Our analysis further reveals that functions access the same stable working set of pages across different invocations of the same function. By leveraging this insight, we build REAP, a light-weight software mechanism for serverless hosts that records functions' stable working set of guest memory pages and proactively prefetches it from disk into memory. Compared to baseline snapshotting, REAP slashes the cold-start delays by 3.7x, on average.
Boris Pismenny, Haggai Eran, Aviad Yehezkel, Liran Liss, , Dan Tsafrir
Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems; https://doi.org/10.1145/3445814.3446732

Abstract:
CPUs routinely offload to NICs network-related processing tasks like packet segmentation and checksum. NIC offloads are advantageous because they free valuable CPU cycles. But their applicability is typically limited to layer≤4 protocols (TCP and lower), and they are inapplicable to layer-5 protocols (L5Ps) that are built on top of TCP. This limitation is caused by a misfeature we call ”offload dependence,” which dictates that L5P offloading additionally requires offloading the underlying layer≤4 protocols and related functionality: TCP, IP, firewall, etc. The dependence of L5P offloading hinders innovation, because it implies hard-wiring the complicated, ever-changing implementation of the lower-level protocols. We propose ”autonomous NIC offloads,” which eliminate offload dependence. Autonomous offloads provide a lightweight software-device architecture that accelerates L5Ps without having to migrate the entire layer≤4 TCP/IP stack into the NIC. A main challenge that autonomous offloads address is coping with out-of-sequence packets. We implement autonomous offloads for two L5Ps: (i) NVMe-over-TCP zero-copy and CRC computation, and (ii) https authentication, encryption, and decryption. Our autonomous offloads increase throughput by up to 3.3x, and they deliver CPU consumption and latency that are as low as 0.4x and 0.7x, respectively. Their implementation is already upstreamed in the Linux kernel, and they will be supported in the next-generation of Mellanox NICs.
Yishen Chen, Charith Mendis, Michael Carbin, Saman Amarasinghe
Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems; https://doi.org/10.1145/3445814.3446692

Abstract:
Vector instructions are ubiquitous in modern processors. Traditional compiler auto-vectorization techniques have focused on targeting single instruction multiple data (SIMD) instructions. However, these auto-vectorization techniques are not sufficiently powerful to model non-SIMD vector instructions, which can accelerate applications in domains such as image processing, digital signal processing, and machine learning. To target non-SIMD instruction, compiler developers have resorted to complicated, ad hoc peephole optimizations, expending significant development time while still coming up short. As vector instruction sets continue to rapidly evolve, compilers cannot keep up with these new hardware capabilities. In this paper, we introduce Lane Level Parallelism (LLP), which captures the model of parallelism implemented by both SIMD and non-SIMD vector instructions. We present VeGen, a vectorizer generator that automatically generates a vectorization pass to uncover target-architecture-specific LLP in programs while using only instruction semantics as input. VeGen decouples, yet coordinates automatically generated target-specific vectorization utilities with its target-independent vectorization algorithm. This design enables us to systematically target non-SIMD vector instructions that until now require ad hoc coordination between different compiler stages. We show that VeGen can use non-SIMD vector instructions effectively, for example, getting speedup 3× (compared to LLVM’s vectorizer) on x265’s idct4 kernel.
Giuseppe Antonio Di Luna, Davide Italiano, Luca Massarelli, Sebastian Österlund, Cristiano Giuffrida, Leonardo Querzoni
Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems; https://doi.org/10.1145/3445814.3446695

Abstract:
Despite the advancements in software testing, bugs still plague deployed software and result in crashes in production. When debugging issues —sometimes caused by “heisenbugs”— there is the need to interpret core dumps and reproduce the issue offline on the same binary deployed. This requires the entire toolchain (compiler, linker, debugger) to correctly generate and use debug information. Little attention has been devoted to checking that such information is correctly preserved by modern toolchains’ optimization stages. This is particularly important as managing debug information in optimized production binaries is non-trivial, often leading to toolchain bugs that may hinder post-deployment debugging efforts. In this paper, we present Debug2, a framework to find debug information bugs in modern toolchains. Our framework feeds random source programs to the target toolchain and surgically compares the debugging behavior of their optimized/unoptimized binary variants. Such differential analysis allows Debug2 to check invariants at each debugging step and detect bugs from invariant violations. Our invariants are based on the (in)consistency of common debug entities, such as source lines, stack frames, and function arguments. We show that, while simple, this strategy yields powerful cross-toolchain and cross-language invariants, which can pinpoint several bugs in modern toolchains. We have used Debug2 to find 23 bugs in the LLVM toolchain (clang/lldb), 8 bugs in the GNU toolchain (GCC/gdb), and 3 in the Rust toolchain (rustc/lldb)—with 14 bugs already fixed by the developers.
Sihang Liu, Suyash Mahar, Baishakhi Ray, Samira Khan
Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems; https://doi.org/10.1145/3445814.3446691

Abstract:
The Persistent Memory (PM) technology combines the persistence of storage with the performance approaching that of DRAM. Programs taking advantage of PM must ensure data remains recoverable after a failure (e.g., power outage), and therefore, are susceptible to having crash consistency bugs that lead to incorrect recovery after a failure. Prior works have provided tools, such as Pmemcheck, PMTest, and XFDetector, that detect these bugs by checking whether the trace of PM accesses violates the program’s crash consistency guarantees. However, detection of crash consistency bugs highly depends on test cases—a bug can only be detected if the buggy program path has been executed. Therefore, using a test case generator is necessary to effectively detect crash consistency bugs. Fuzzing is a common test case generation approach that requires minimum knowledge about the program. We identify that PM programs have special requirements for fuzzing. First, a PM program maintains a persistent state on PM images. Therefore, the fuzzer needs to efficiently generate valid images as part of the test case. Second, these PM images can also be a result of a previous crash, which requires the fuzzer to generate crash images as well. Finally, PM programs can have various procedures but only those performing PM operations can lead to crash consistency issues. Thus, an efficient fuzzer should target those relevant regions. In this work, we provide PMFuzz, a test case generator for PM programs that meets these new requirements. Our evaluation shows that PMFuzz covers 4.6× more PM-related paths compared to AFL++, a widely-used fuzzer. Further, test cases generated by PMFuzz discovered 12 new real-world bugs in PM programs which have already been extensively tested by prior PM testing works.
, Thomas Benjamin, Jesse Elwell, Jeffrey A. Eitel, Angelo Sapello, Abhrajit Ghosh
Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems; https://doi.org/10.1145/3445814.3446729

Abstract:
Side-channel attacks such as Spectre rely on properties of modern CPUs that permit discovery of microarchitectural state via timing of various operations. The Weird Machine concept is an increasingly popular model for characterization of emergent execution that arises from side-effects of conventional computing constructs. In this work we introduce Microarchitectural Weird Machines (µWM): code constructions that allow performing computation through the means of side effects and conflicts between microarchitectual entities such as branch predictors and caches. The results of such computations are observed as timing variations. We demonstrate how µWMs can be used as a powerful obfuscation engine where computation operates based on events unobservable to conventional anti-obfuscation tools based on emulation, debugging, static and dynamic analysis techniques. We demonstrate that µWMs can be used to reliably perform arbitrary computation by implementing a SHA-1 hash function. We then present a practical example in which we use a µWM to obfuscate malware code such that its passive operation is invisible to an observer with full power to view the architectural state of the system until the code receives a trigger. When the trigger is received the malware decrypts and executes its payload. To show the effectiveness of obfuscation we demonstrate its use in the concealment and subsequent execution of a payload that exfiltrates a shadow password file, and a payload that creates a reverse shell.
Adrien Ghosn, Marios Kogias, Mathias Payer, James R. Larus, Edouard Bugnion
Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems; https://doi.org/10.1145/3445814.3446728

Abstract:
Programming languages and systems have failed to address the security implications of the increasingly frequent use of public libraries to construct modern software. Most languages provide tools and online repositories to publish, import, and use libraries; however, this double-edged sword can incorporate a large quantity of unknown, unchecked, and unverified code into an application. The risk is real, as demonstrated by malevolent actors who have repeatedly inserted malware into popular open-source libraries. This paper proposes a solution: enclosures, a new programming language construct for library isolation that provides a developer with fine-grain control over the resources that a library can access, even for libraries with complex inter-library dependencies. The programming abstraction is language-independent and could be added to most languages. These languages would then be able to take advantage of hardware isolation mechanisms that are effective across language boundaries. The enclosure policies are enforced at run time by LitterBox, a language-independent framework that uses hardware mechanisms to provide uniform and robust isolation guarantees, even for libraries written in unsafe languages. LitterBox currently supports both Intel VT-x (with general-purpose extended page tables) and the emerging Intel Memory Protection Keys (MPK). We describe an enclosure implementation for the Go and Pythonlanguages. Our evaluation demonstrates that the Go implementation can protect sensitive data in real-world applications constructed using complex untrusted libraries with deep dependencies. It requires minimal code refactoring and incurs acceptable performance overhead. The Python implementation demonstrates LitterBox’s ability to support dynamic languages.
Back to Top Top