skip to main content
10.1145/3627703.3650091acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

Bandle: Asynchronous State Machine Replication Made Efficient

Published: 22 April 2024 Publication History

Abstract

State machine replication (SMR) uses consensus as its core component for reaching agreement among a group of processes, in order to provide fault-tolerant services. Most SMR protocols, such as Paxos and Raft, are designed in the partial synchrony model. Partially synchronous protocols rely on timing assumptions to elect a special role (such as the leader), which may become the performance bottleneck under a heavy workload. From an engineering perspective, partially synchronous protocols have to wait for a pre-defined period of time and implement a (complicated) failover mechanism in order to replace the faulty leader. In contrast, asynchronous protocols are immune to such problems.
This paper presents Bandle, a simple and highly efficient asynchronous SMR protocol. Instead of electing a special role, Bandle evenly assigns sequence numbers to each process and proceeds in a leaderless manner. We further propose a binary agreement protocol, referred to as FlashBA, which decides whether a given proposal can be committed. FlashBA is inspired by Ben-Or's randomized algorithm but leverages a promise mechanism to achieve optimal latency (i.e., one message delay in the best case). An empirical study on the Amazon EC2 platform shows that Bandle delivers exceptional performance when deployed within a data center and across the globe.

References

[1]
Etcd tuning. https://etcd.io/docs/v3.4/tuning/.
[2]
noise package. https://github.com/perlin-network/noise.
[3]
Paxos and epaxos implementation. https://github.com/efficient/epaxos.
[4]
Paxos, raft, epaxos: How has distributed consensus technology evolved? https://alibaba-cloud.medium.com/paxos-raft-epaxos-how-has-distributed-consensus-technology-evolved-73efb06aea0a.
[5]
Pluscal example for ben-or's algorithm. https://github.com/muratdem/PlusCal-examples/tree/master/BenOr.
[6]
Rabia implementation. https://github.com/haochenpan/rabia/.
[7]
Marcos K. Aguilera, Carole Delporte-Gallet, Hugues Fauconnier, and Sam Toueg. Stable leader election. In Jennifer Welch, editor, Distributed Computing, pages 108--122, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg.
[8]
Marcos K. Aguilera and Sam Toueg. Randomization and failure detection: A hybrid approach to solve consensus. In Özalp Babaoğlu and Keith Marzullo, editors, Distributed Algorithms, pages 29--39, Berlin, Heidelberg, 1996. Springer Berlin Heidelberg.
[9]
Marcos K. Aguilera and Sam Toueg. The correctness proof of ben-or's randomized consensus algorithm. Distributed Computing, 25(5):371--381, 2012.
[10]
Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, and Tevfik Kosar. Wpaxos: Wide area network flexible consensus. IEEE Trans. Parallel Distrib. Syst., 31(1):211--223, jan 2020.
[11]
James Aspnes. Randomized protocols for asynchronous consensus. Distrib. Comput., 16(2-3):165--175, sep 2003.
[12]
Leemon Baird. The swirlds hashgraph consensus algorithm: Fair, fast, byzantine fault tolerance. Swirlds, Inc. Technical Report SWIRLDS-TR-2016, 1, 2016.
[13]
Johannes Behl, Tobias Distler, and Rüdiger Kapitza. Consensus-oriented parallelization: How to earn your first million. In Proceedings of the 16th Annual Middleware Conference, Middleware '15, page 173--184, New York, NY, USA, 2015. Association for Computing Machinery.
[14]
Michael Ben-Or. Another advantage of free choice (extended abstract): Completely asynchronous agreement protocols. In Proceedings of the Second Annual ACM Symposium on Principles of Distributed Computing, PODC '83, page 27--30, New York, NY, USA, 1983. Association for Computing Machinery.
[15]
Michael Ben-Or, Boaz Kelmer, and Tal Rabin. Asynchronous secure computations with optimal resilience (extended abstract). In Proceedings of the Thirteenth Annual ACM Symposium on Principles of Distributed Computing, PODC '94, page 183--192, New York, NY, USA, 1994. Association for Computing Machinery.
[16]
Gabriel Bracha and Sam Toueg. Asynchronous consensus and broadcast protocols. J. ACM, 32(4):824--840, oct 1985.
[17]
Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. TAO: Facebook's distributed data store for the social graph. In 2013 USENIX Annual Technical Conference (USENIX ATC 13), pages 49--60, San Jose, CA, June 2013. USENIX Association.
[18]
Mike Burrows. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, 2006.
[19]
Christian Cachin, Klaus Kursawe, and Victor Shoup. Random oracles in constantinople: Practical asynchronous byzantine agreement using cryptography. Journal of Cryptology, 18(3):219--246, 2005.
[20]
Ran Canetti and Tal Rabin. Fast asynchronous byzantine agreement with optimal resilience. In Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing, STOC '93, page 42--51, New York, NY, USA, 1993. Association for Computing Machinery.
[21]
Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. Paxos made live: An engineering perspective. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing, PODC '07, page 398--407, New York, NY, USA, 2007. Association for Computing Machinery.
[22]
Tushar Deepak Chandra, Vassos Hadzilacos, and Sam Toueg. The weakest failure detector for solving consensus. J. ACM, 43(4):685--722, jul 1996.
[23]
Zixuan Chen, Lei Fan, Shengyun Liu, Marko Vukolić, Xiangzhe Wang, and Jingjing Zhang. Bridging the gap of timing assumptions in byzantine consensus. In Proceedings of the 24th International Middleware Conference, Middleware '23, page 178--191, New York, NY, USA, 2023. Association for Computing Machinery.
[24]
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. Spanner: Google's globally-distributed database. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, OSDI'12, pages 251--264, Berkeley, CA, USA, 2012. USENIX Association.
[25]
Heming Cui, Rui Gu, Cheng Liu, Tianyu Chen, and Junfeng Yang. Paxos made transparent. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, page 105--120, New York, NY, USA, 2015. Association for Computing Machinery.
[26]
George Danezis, Lefteris Kokoris-Kogias, Alberto Sonnino, and Alexander Spiegelman. Narwhal and tusk: A dag-based mempool and efficient bft consensus. In Proceedings of the Seventeenth European Conference on Computer Systems, EuroSys '22, page 34--50, New York, NY, USA, 2022. Association for Computing Machinery.
[27]
Huynh Tu Dang, Pietro Bressana, Han Wang, Ki Suh Lee, Noa Zilberman, Hakim Weatherspoon, Marco Canini, Fernando Pedone, and Robert Soulé. P4xos: Consensus as a network service. IEEE/ACM Trans. Netw., 28(4):1726--1738, aug 2020.
[28]
Jiaqing Du, Daniele Sciascia, Sameh Elnikety, Willy Zwaenepoel, and Fernando Pedone. Clock-rsm: Low-latency inter-datacenter state machine replication using loosely synchronized physical clocks. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 343--354, 2014.
[29]
Sisi Duan, Michael K. Reiter, and Haibin Zhang. Beat: Asynchronous bft made practical. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS '18, page 2028--2041, New York, NY, USA, 2018. Association for Computing Machinery.
[30]
Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Consensus in the presence of partial synchrony. J. ACM, 35(2):288--323, April 1988.
[31]
Mostafa Elhemali, Niall Gallagher, Nick Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somasundaram Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, and Akshat Vig. Amazon DynamoDB: A scalable, predictably performant, and fully managed NoSQL database service. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 1037--1048, Carlsbad, CA, July 2022. USENIX Association.
[32]
Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra. Efficient replication via timestamp stability. In Proceedings of the Sixteenth European Conference on Computer Systems, EuroSys '21, page 178--193, New York, NY, USA, 2021. Association for Computing Machinery.
[33]
Vitor Enes, Carlos Baquero, Tuanir França Rezende, Alexey Gotsman, Matthieu Perrin, and Pierre Sutra. State-machine replication for planetscale systems. In Proceedings of the Fifteenth European Conference on Computer Systems, EuroSys '20, New York, NY, USA, 2020. Association for Computing Machinery.
[34]
P. Ezhilchelvan, A. Mostefaoui, and M. Raynal. Randomized multivalued consensus. In Fourth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing. ISORC 2001, pages 195--200, 2001.
[35]
P. Felber, X. Defago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, pages 132--141, 1999.
[36]
Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374--382, April 1985.
[37]
Bingyong Guo, Zhenliang Lu, Qiang Tang, Jing Xu, and Zhenfeng Zhang. Dumbo: Faster asynchronous bft protocols. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, CCS '20, page 803--818, New York, NY, USA, 2020. Association for Computing Machinery.
[38]
Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan Pei, and Xin Tang. Tidb: A raft-based htap database. Proc. VLDB Endow., 13(12):3072--3084, aug 2020.
[39]
Sagar Jha, Jonathan Behrens, Theo Gkountouvas, Matthew Milano, Weijia Song, Edward Tremel, Robbert Van Renesse, Sydney Zink, and Kenneth P. Birman. Derecho: Fast state machine replication for cloud services. ACM Trans. Comput. Syst., 36(2), apr 2019.
[40]
Marios Kogias and Edouard Bugnion. Hovercraft: Achieving scalability and fault-tolerance for microsecond-scale datacenter services. In Proceedings of the Fifteenth European Conference on Computer Systems, EuroSys '20, New York, NY, USA, 2020. Association for Computing Machinery.
[41]
Leslie Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16(2):133--169, may 1998.
[42]
Leslie Lamport. Paxos made simple. ACM SIGACT News, 32(4), December 2001.
[43]
Leslie Lamport. Generalized consensus and paxos. Technical Report MSR-TR-2005-33, March 2005.
[44]
Leslie Lamport. The pluscal algorithm language. In Martin Leucker and Carroll Morgan, editors, Theoretical Aspects of Computing - ICTAC 2009, pages 36--60, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.
[45]
Leslie Lamport, Dahlia Malkhi, and Lidong Zhou. Reconfiguring a state machine. SIGACT News, 41(1):63--73, mar 2010.
[46]
Shengyun Liu, Paolo Viotti, Christian Cachin, Vivien Quéma, and Marko Vukolic. Xft: Practical fault tolerance beyond crashes. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, page 485--500, USA, 2016. USENIX Association.
[47]
Shengyun Liu and Marko Vukolić. Leader set selection for low-latency geo-replicated state machine. IEEE Transactions on Parallel and Distributed Systems, 28(7):1933--1946, 2017.
[48]
Shengyun Liu, Wenbo Xu, Chen Shan, Xiaofeng Yan, Tianjing Xu, Bo Wang, Lei Fan, Fuxi Deng, Ying Yan, and Hui Zhang. Flexible advancement in asynchronous bft consensus. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP '23, page 264--280, New York, NY, USA, 2023. Association for Computing Machinery.
[49]
Xuhao Luo, Weihai Shen, Shuai Mu, and Tianyin Xu. DepFast: Orchestrating code of quorum systems. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 557--574, Carlsbad, CA, July 2022. USENIX Association.
[50]
Yanhua Mao, Flavio P. Junqueira, and Keith Marzullo. Mencius: Building efficient replicated state machines for wans. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI'08, pages 369--384, Berkeley, CA, USA, 2008. USENIX Association.
[51]
Andrew Miller, Yu Xia, Kyle Croman, Elaine Shi, and Dawn Song. The honey badger of bft protocols. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS '16, page 31--42, New York, NY, USA, 2016. Association for Computing Machinery.
[52]
Iulian Moraru, David G. Andersen, and Michael Kaminsky. There is more consensus in egalitarian parliaments. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, page 358--372, New York, NY, USA, 2013. Association for Computing Machinery.
[53]
Iulian Moraru, David G. Andersen, and Michael Kaminsky. Paxos quorum leases: Fast reads without sacrificing writes. In Proceedings of the ACM Symposium on Cloud Computing, SOCC '14, page 1--13, New York, NY, USA, 2014. Association for Computing Machinery.
[54]
Achour Mostefaoui, Hamouma Moumen, and Michel Raynal. Signature-free asynchronous byzantine consensus with t < n/3 and o(n2) messages. In Proceedings of the 2014 ACM Symposium on Principles of Distributed Computing, PODC '14, page 2--9, New York, NY, USA, 2014. Association for Computing Machinery.
[55]
Brian M. Oki and Barbara H. Liskov. Viewstamped replication: A new primary copy method to support highly-available distributed systems. In Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing, PODC '88, page 8--17, New York, NY, USA, 1988. Association for Computing Machinery.
[56]
Afonso Oliveira, Henrique Moniz, and Rodrigo Rodrigues. Aleabft: Practical asynchronous byzantine fault tolerance. CoRR, abs/2202.02071, 2022.
[57]
Diego Ongaro and John Ousterhout. In search of an understandable consensus algorithm. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 305--319, Philadelphia, PA, June 2014. USENIX Association.
[58]
Haochen Pan, Jesse Tuglu, Neo Zhou, Tianshu Wang, Yicheng Shen, Xiong Zheng, Joseph Tassarotti, Lewis Tseng, and Roberto Palmieri. Rabia: Simplifying state-machine replication through randomization. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SOSP '21, page 472--487, New York, NY, USA, 2021. Association for Computing Machinery.
[59]
M. Pease, R. Shostak, and L. Lamport. Reaching agreement in the presence of faults. J. ACM, 27(2):228--234, apr 1980.
[60]
Michael O. Rabin. Randomized byzantine generals. In 24th Annual Symposium on Foundations of Computer Science (sfcs 1983), pages 403--409, 1983.
[61]
Nuno Santos and André Schiper. Tuning paxos for high-throughput with batching and pipelining. In Luciano Bononi, Ajoy K. Datta, Stéphane Devismes, and Archan Misra, editors, Distributed Computing and Networking, pages 153--167, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
[62]
Fred B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv., 22(4):299--319, dec 1990.
[63]
Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. Cockroachdb: The resilient geo-distributed sql database. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD '20, page 1493--1509, New York, NY, USA, 2020. Association for Computing Machinery.
[64]
Pasindu Tennage, Cristina Basescu, Lefteris Kokoris-Kogias, Ewa Syta, Philipp Jovanovic, Vero Estrada-Galinanes, and Bryan Ford. Quepaxa: Escaping the tyranny of timeouts in consensus. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP '23, page 281--297, New York, NY, USA, 2023. Association for Computing Machinery.
[65]
Douglas B. Terry, Vijayan Prabhakaran, Ramakrishna Kotla, Mahesh Balakrishnan, Marcos K. Aguilera, and Hussam Abu-Libdeh. Consistency-based service level agreements for cloud storage. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, page 309--324, New York, NY, USA, 2013. Association for Computing Machinery.
[66]
Sarah Tollman, Seo Jin Park, and John Ousterhout. EPaxos revisited. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 613--632. USENIX Association, April 2021.
[67]
Robbert van Renesse. Asynchronous consensus without rounds. CoRR, abs/1908.10716, 2019.
[68]
Robbert Van Renesse and Deniz Altinbuken. Paxos made moderately complex. ACM Comput. Surv., 47(3), feb 2015.
[69]
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. Large-scale cluster management at google with borg. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys '15, New York, NY, USA, 2015. Association for Computing Machinery.
[70]
M. Vukolic. Quorum Systems: With Applications to Storage and Consensus. 2012.
[71]
Jian Yin, L. Alvisi, M. Dahlin, and C. Lin. Volume leases for consistency in large-scale systems. IEEE Transactions on Knowledge and Data Engineering, 11(4):563--576, 1999.
[72]
Maofan Yin, Dahlia Malkhi, Michael K. Reiter, Guy Golan Gueta, and Ittai Abraham. Hotstuff: Bft consensus with linearity and responsiveness. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, PODC '19, page 347--356, New York, NY, USA, 2019. Association for Computing Machinery.
[73]
Jianjun Zheng, Qian Lin, Jiatao Xu, Cheng Wei, Chuwei Zeng, Pingan Yang, and Yunfan Zhang. Paxosstore: High-availability storage made practical in wechat. Proc. VLDB Endow., 10(12):1730--1741, aug 2017.
[74]
Siyuan Zhou and Shuai Mu. Fault-Tolerant replication with Pull-Based consensus in MongoDB. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 687--703. USENIX Association, April 2021.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
EuroSys '24: Proceedings of the Nineteenth European Conference on Computer Systems
April 2024
1245 pages
ISBN:9798400704376
DOI:10.1145/3627703
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 April 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. State machine replication
  2. asynchrony
  3. consensus

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • NSFC

Conference

EuroSys '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 466
    Total Downloads
  • Downloads (Last 12 months)466
  • Downloads (Last 6 weeks)44
Reflects downloads up to 30 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media