Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleMarch 2019
FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems
- Jeffrey F. Lukman,
- Huan Ke,
- Cesar A. Stuardo,
- Riza O. Suminto,
- Daniar H. Kurniawan,
- Dikaimin Simon,
- Satria Priambada,
- Chen Tian,
- Feng Ye,
- Tanakorn Leesatapornwongsa,
- Aarti Gupta,
- Shan Lu,
- Haryadi S. Gunawi
EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019Article No.: 20, Pages 1–16https://doi.org/10.1145/3302424.3303986We present a fast and scalable testing approach for datacenter/cloud systems such as Cassandra, Hadoop, Spark, and ZooKeeper. The uniqueness of our approach is in its ability to overcome the path/state-space explosion problem in testing workloads with ...
- research-articleOctober 2018
Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems
- Haryadi S. Gunawi,
- Riza O. Suminto,
- Russell Sears,
- Casey Golliher,
- Swaminathan Sundararaman,
- Xing Lin,
- Tim Emami,
- Weiguang Sheng,
- Nematollah Bidokhti,
- Caitie McCaffrey,
- Deepthi Srinivasan,
- Biswaranjan Panda,
- Andrew Baptist,
- Gary Grider,
- Parks M. Fields,
- Kevin Harms,
- Robert B. Ross,
- Andree Jacobson,
- Robert Ricci,
- Kirk Webb,
- Peter Alvaro,
- H. Birali Runesha,
- Mingzhe Hao,
- Huaicheng Li
ACM Transactions on Storage (TOS), Volume 14, Issue 3Article No.: 23, Pages 1–26https://doi.org/10.1145/3242086Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, ...
- research-articleDecember 2017
Rivulet: a fault-tolerant platform for smart-home applications
Middleware '17: Proceedings of the 18th ACM/IFIP/USENIX Middleware ConferencePages 41–54https://doi.org/10.1145/3135974.3135988Rivulet is a fault-tolerant distributed platform for running smart-home applications; it can tolerate failures typical for a home environment (e.g., link losses, network partitions, sensor failures, and device crashes). In contrast to existing cloud-...
- research-articleSeptember 2017
PBSE: a robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworks
- Riza O. Suminto,
- Cesar A. Stuardo,
- Alexandra Clark,
- Huan Ke,
- Tanakorn Leesatapornwongsa,
- Bo Fu,
- Daniar H. Kurniawan,
- Vincentius Martin,
- Maheswara Rao G. Uma,
- Haryadi S. Gunawi
SoCC '17: Proceedings of the 2017 Symposium on Cloud ComputingPages 295–308https://doi.org/10.1145/3127479.3131622We reveal loopholes of Speculative Execution (SE) implementations under a unique fault model: node-level network throughput degradation. This problem appears in many data-parallel frameworks such as Hadoop MapReduce and Spark. To address this, we ...
- research-articleMay 2017
Scalability Bugs: When 100-Node Testing is Not Enough
- Tanakorn Leesatapornwongsa,
- Cesar A. Stuardo,
- Riza O. Suminto,
- Huan Ke,
- Jeffrey F. Lukman,
- Haryadi S. Gunawi
HotOS '17: Proceedings of the 16th Workshop on Hot Topics in Operating SystemsPages 24–29https://doi.org/10.1145/3102980.3102985We highlight the problem of scalability bugs, a new class of bugs that appear in "cloud-scale" distributed systems. Scalability bugs are latent bugs that are cluster-scale dependent, whose symptoms typically surface in large-scale deployments, but not ...
- research-articleOctober 2016
Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service Outages
- Haryadi S. Gunawi,
- Mingzhe Hao,
- Riza O. Suminto,
- Agung Laksono,
- Anang D. Satria,
- Jeffry Adityatama,
- Kurnia J. Eliazar
SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud ComputingPages 1–16https://doi.org/10.1145/2987550.2987583We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed 1247 headline news and public post-mortem reports that detail 597 unplanned outages that occurred within a 7-year span from 2009 to 2015. We analyzed outage duration, ...