- Sponsor:
- sigops
Welcome to the 30th ACM Symposium on Operating Systems Principles (SOSP 2024)! We are delighted to present these 43 papers that reflect today's broad range of topics that comprise modern computer systems research, including file and storage systems, memory systems, distributed systems, verification, security, system support for machine learning, microservices, fault-tolerance and reliability, debugging, and, of course, operating systems.
Proceeding Downloads
Autobahn: Seamless high speed BFT
Today's practical, high performance Byzantine Fault Tolerant (BFT) consensus protocols operate in the partial synchrony model. However, existing protocols are inefficient when deployments are indeed partially synchronous. They deliver either low latency ...
SWARM: Replicating Shared Disaggregated-Memory Data in No Time
- Antoine Murat,
- Clément Burgelin,
- Athanasios Xygkis,
- Igor Zablotchi,
- Marcos Kawazoe Aguilera,
- Rachid Guerraoui
Memory disaggregation is an emerging data center architecture that improves resource utilization and scalability. Replication is key to ensure the fault tolerance of applications, but replicating shared data in disaggregated memory is hard. We propose ...
Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault Injection
Debugging a failure usually requires reproducing it first. This can be hard for failures in production distributed systems, where bugs are exposed only by some unusual faulty events. While fault injection testing becomes popular, existing solutions are ...
If At First You Don’t Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems
- Bogdan Alexandru Stoica,
- Utsav Sethi,
- Yiming Su,
- Cyrus Zhou,
- Shan Lu,
- Jonathan Mace,
- Madanlal Musuvathi,
- Suman Nath
Retry---the re-execution of a task on failure---is a common mechanism to enable resilient software systems. Yet, despite its commonality and long history, retry remains difficult to implement and test.
Guided by our study of real-world retry issues, we ...
Tiered Memory Management: Access Latency is the Key!
The emergence of tiered memory architectures has led to a renewed interest in memory management. Recent works on tiered memory management innovate on mechanisms for access tracking, page migration, and dynamic page size determination; however, they all ...
Fast & Safe IO Memory Protection
IO Memory protection mechanisms prevent malicious and/or buggy IO devices from executing errant transfers into memory. Modern servers achieve this using an IOMMU---IO devices operate on virtual addresses, and IOMMU translates virtual addresses to ...
CHIME: A Cache-Efficient and High-Performance Hybrid Index on Disaggregated Memory
Disaggregated memory (DM) is a widely discussed datacenter architecture in academia and industry. It decouples computing and memory resources from monolithic servers into two network-connected resource pools. Range indexes are widely adopted by storage ...
Aceso: Achieving Efficient Fault Tolerance in Memory-Disaggregated Key-Value Stores
Disaggregated memory (DM) has garnered increasing attention due to high resource utilization. Fault tolerance is critical for key-value (KV) stores on DM since machine failures are common in datacenters. Existing KV stores on DM are generally based on ...
Reducing Energy Bloat in Large Model Training
Training large AI models on numerous GPUs consumes a massive amount of energy, making power delivery one of the largest limiting factors in building and operating datacenters for AI workloads. However, we observe that not all energy consumed during ...
Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor
To speed up computation, deep neural networks (DNNs) usually rely on highly optimized tensor operators. Despite the effectiveness, tensor operators are often defined empirically with ad hoc semantics. This hinders the analysis and optimization across ...
Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
- Hao Ge,
- Fangcheng Fu,
- Haoyang Li,
- Xuanyu Wang,
- Sheng Lin,
- Yujie Wang,
- Xiaonan Nie,
- Hailin Zhang,
- Xupeng Miao,
- Bin Cui
Training of large-scale deep learning models necessitates parallelizing the model and data across numerous devices, and the choice of parallelism strategy substantially depends on the training workloads such as memory consumption, computation cost, and ...
Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections
Deep learning (DL) jobs use multi-dimensional parallelism, i.e., combining data, model, and pipeline parallelism, to use large GPU clusters efficiently. Long-running jobs may experience changes to their GPU allocation: (i) resource elasticity during ...
ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
Training large Deep Neural Network (DNN) models requires thousands of GPUs over the course of several days or weeks. At this scale, failures are frequent and can have a big impact on training throughput. Utilizing spare GPU servers to mitigate ...
OZZ: Identifying Kernel Out-of-Order Concurrency Bugs with In-Vivo Memory Access Reordering
Kernel concurrency bugs are notoriously difficult to identify, while their consequences severely threaten the reliability and security of the entire system. Especially in the kernel, developers should consider not only locks but also memory barriers to ...
Fast, Flexible, and Practical Kernel Extensions
The ability to safely extend OS kernel functionality is a longstanding goal in OS design, with the widespread use of the eBPF framework in Linux and Windows demonstrating the benefits of such extensibility. However, existing solutions for kernel ...
Skyloft: A General High-Efficient Scheduling Framework in User Space
Skyloft is a general and highly efficient user-space scheduling framework. It leverages user-mode interrupt to deliver and process hardware timers directly in user space. This capability enables Skyloft to achieve μs-scale preemption. Skyloft offers a ...
Fast Core Scheduling with Userspace Process Abstraction
We introduce uProcess, a pure userspace process abstraction that enables CPU cores to be rescheduled among applications at sub-microsecond timescale without trapping into the kernel. We achieve this by constructing a special privileged mode in userspace ...
LazyLog: A New Shared Log Abstraction for Low-Latency Applications
Shared logs offer linearizable total order across storage shards. However, they enforce this order eagerly upon ingestion, leading to high latencies. We observe that in many modern shared-log applications, while linearizable ordering is necessary, it is ...
BIZA: Design of Self-Governing Block-Interface ZNS AFA for Endurance and Performance
- Shushu Yi,
- Shaocong Sun,
- Li Peng,
- Yingbo Sun,
- Ming-Chang Yang,
- Zhichao Cao,
- Qiao Li,
- Myoungsoo Jung,
- Ke Zhou,
- Jie Zhang
All-flash array (AFA) has become one of the most popular storage forms in diverse computing domains. While traditional AFA implementations adopt the block interface to seamlessly integrate with most existing software, this interface hinders the host from ...
Morph: Efficient File-Lifetime Redundancy Management for Cluster File Systems
- Timothy Kim,
- Sanjith Athlur,
- Saurabh Kadekodi,
- Francisco Maturana,
- Dax Delvira,
- Arif Merchant,
- Gregory R. Ganger,
- K. V. Rashmi
Many data services tune and change redundancy configurations of files over their lifetimes to address changes in data temperature and latency requirements. Unfortunately, changing redundancy configs (transcode) is IO-intensive. The Morph cluster file ...
Reducing Cross-Cloud/Region Costs with the Auto-Configuring MACARON Cache
An increasing demand for cross-cloud and cross-region data access is bringing forth challenges related to high data transfer costs and latency. In response, we introduce Macaron, an auto-configuring cache system designed to minimize cost for remote data ...
Dirigent: Lightweight Serverless Orchestration
While Function as a Service (FaaS) platforms can initialize function sandboxes on worker nodes in 10-100s of milliseconds, the latency to schedule functions in real FaaS clusters can be orders of magnitude higher. The current approach of building FaaS ...
Unifying serverless and microservice workloads with SigmaOS
Many cloud applications use both serverless functions, for bursts of stateless parallel computation, and container orchestration, for long-running microservices and tasks that need to interact. Ideally a single platform would offer the union of these ...
Caribou: Fine-Grained Geospatial Shifting of Serverless Applications for Sustainability
Sustainability in computing is critical as environmental concerns rise. The cloud industry's carbon footprint is significant and rapidly growing. We show that dynamic geospatial shifting of cloud workloads to regions with lower carbon emission energy ...
TrEnv: Transparently Share Serverless Execution Environments Across Different Functions and Nodes
- Jialiang Huang,
- MingXing Zhang,
- Teng Ma,
- Zheng Liu,
- Sixing Lin,
- Kang Chen,
- Jinlei Jiang,
- Xia Liao,
- Yingdi Shan,
- Ning Zhang,
- Mengting Lu,
- Tao Ma,
- Haifeng Gong,
- YongWei Wu
Serverless computing is renowned for its computation elasticity, yet its full potential is often constrained by the requirement for functions to operate within local and dedicated background environments, resulting in limited memory elasticity. To ...
Verus: A Practical Foundation for Systems Verification
- Andrea Lattuada,
- Travis Hance,
- Jay Bosamiya,
- Matthias Brun,
- Chanhee Cho,
- Hayley LeBlanc,
- Pranav Srinivasan,
- Reto Achermann,
- Tej Chajed,
- Chris Hawblitzel,
- Jon Howell,
- Jacob R. Lorch,
- Oded Padon,
- Bryan Parno
Formal verification is a promising approach to eliminate bugs at compile time, before they ship. Indeed, our community has verified a wide variety of system software. However, much of this success has required heroic developer effort, relied on bespoke ...
Practical Verification of System-Software Components Written in Standard C
Systems code is challenging to verify, because it uses constructs (like raw pointers, pointer arithmetic, and bit twiddling) that are hard for tools to reason about. Existing approaches either sacrifice programmer friendliness, by demanding significant ...
Icarus: Trustworthy Just-In-Time Compilers with Symbolic Meta-Execution
- Naomi Smith,
- Abhishek Sharma,
- John Renner,
- David Thien,
- Fraser Brown,
- Hovav Shacham,
- Ranjit Jhala,
- Deian Stefan
Just-in-time (JIT) compilers make JavaScript run efficiently by replacing slow JavaScript interpreter code with fast machine code. However, this efficiency comes at a cost: bugs in JIT compilers can completely subvert all language-based (memory) safety ...
SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree Inference
The proliferation of machine learning together with the rapid evolution of the hardware ecosystem has led to a surge in the demand for model inference on a variety of hardware. Decision tree based models are the most popular models on tabular data. This ...
Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10
As AI chips incorporate numerous parallelized cores to scale deep learning (DL) computing, inter-core communication is enabled recently by employing high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore IPU). It allows each core ...
Index Terms
- Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles