skip to main content
article

A short introduction to failure detectors for asynchronous distributed systems

Published: 01 March 2005 Publication History

Abstract

Since the first version of Chandra and Toueg's seminal paper titled "Unreliable failure detectors for reliable distributed systems" in 1991, the failure detector concept has been extensively studied and investigated. This is not at all surprising as failure detection is pervasive in the design, the analysis and the implementation of a lot of fault-tolerant distributed algorithms that constitute the core of distributed system middleware.The literature on this topic is mostly technical and appears mainly in theoretically inclined journals and conferences. The aim of this paper is to offer an introductory survey to the failure detector concept for readers who are not familiar with it and want to quickly understand its aim, its basic principles, its power and limitations. To attain this goal, the paper first describes the motivations that underlie the concept, and then surveys several distributed computing problems showing how they can be solved with the help of an appropriate failure detector. So, this short paper presents motivations, concepts, problems, definitions, and algorithms. It does not contain proofs. It is aimed at people who want to understand basics of failure detectors.

References

[1]
Attiya H., Bar-Noy A. and Dolev D., Sharing Memory Robustly in Message Passing Systems. Journal of the ACM, 42(1):121--132, 1995.]]
[2]
Aguilera M. K., Chen W. and Toueg S., On Quiescent Reliable Communication. SIAM Journal of Computing, 29(6):2040--2073, 2000.]]
[3]
Aguilera M. K., Delporte-Gallet C., Fauconnier H. and Toueg S., Thrifty Generic Broadcast. Proc. 14th Symposium on Distributed Computing (DISC'00), Springer-Verlag LNCS #1914, pp. 268--282, 2000.]]
[4]
Aguilera M. K., Delporte-Gallet C., Fauconnier H. and Toueg S., On Implementing Ω with Weak Reliability and Synchrony Assumptions. Proc. 22th ACM Symposium on Principles of Distributed Computing (PODC'03), ACM Press, pp. 306--314, Boston (MA), 2003.]]
[5]
Aguilera M. K., Delporte-Gallet C., Fauconnier H. and Toueg S., Communication-Efficient Leader Election and Consensus with Limited Link Synchrony. Proc. 23th ACM Symposium on Principles of Distributed Computing (PODC'04), ACM Press, pp. 328-337, St-John's (Newfoundland, Canada), 2004.]]
[6]
Aguilera M. K., Le Lann G. and Toueg S., On the Impact of Fast failure Detectors on Real-Time Fault-Tolerant Systems. Proc. 16th Symposium on Distributed Computing (DISC'02), Springer-Verlag LNCS #2508, pp. 354--369, 2002.]]
[7]
Aguilera M. K. and Toueg S., Aguilera M. K. and Toueg S., Failure Detection and Randomization: a Hybrid Approach to Solve Consensus. SIAM Journal of Computing, 28(3):890--903, 1998.]]
[8]
Aguilera M. K., Toueg S. and Deianov B., Revisiting the Weakest Failure Detector for Uniform Reliable Broadcast. Proc. 13th Int. Symposium on DIStributed Computing (DISC'99), Springer-Verlag LNCS #1693, pp. 21--34, 1999.]]
[9]
Attiya H. and Welch J. Distributed Computing: Fundamentals, Simulations and Advanced Topics. McGraw-Hill, 451 pages, 1988.]]
[10]
Ben-Or M., Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols. 2nd ACM Symposium on Principles of Distributed Computing, (PODC'83), Montréal (CA), pp. 27--30, 1983.]]
[11]
Bertier M., Marin O. and Sens P., Implementation and Performance Evaluation of an Adaptable Failure Detector. Proc. Int. IEEE Conference on Dependable Systems and Networks (DSN'02), IEEE Computer Society Press, pp. 354--363, Washington D.C., 2002.]]
[12]
Chandra T. D., Hadzilacos V. and Toueg S., The Weakest Failure Detector for Solving Consensus. Journal of the ACM, 43(4):685--722, 1996.]]
[13]
Chandra T. D. and Toueg S., Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM, 43(2):225--267, 1996. (First version published in the proceedings of the 10th ACM Symposium on Principles of Distributed Computing, 1991.)]]
[14]
Chen W., Toueg S. and Aguilera M. K., On the Quality of Service of Failure Detectors. IEEE Transactions on Computers, 51(5):561--580, 2002.]]
[15]
Chor M., and Dwork C., Randomization in Byzantine Agreement. Adv. in Comp. Research, 5:443--497, 1989.]]
[16]
Chu F., Reducing Ω to ◊W. Information Processing Letters, 76(6):293--298, 1998.]]
[17]
Delporte-Gallet C., Fauconnier H. and Guerraoui R., A Realistic Look at Failure Detectors. Proc. IEEE Inter. Conference on Dependable Systems and Networks (DSN'02), IEEE Computer Society Press, pp. 345--352, Washington D.C., 2002.]]
[18]
Delporte-Gallet C., Fauconnier H. and Guerraoui R., Failure Detection Lower Bounds on Registers and Consensus. Proc. 16th Symposium on Distributed Computing (DISC'02), Springer-Verlag LNCS #2508, pp. 237--251, 2002.]]
[19]
Delporte-Gallet C., Fauconnier H. and Guerraoui R., Shared memory vs Message Passing. Tech Report IC/2003/77, EPFL, Lausanne, December 2003.]]
[20]
Delporte-Gallet C., Fauconnier H. and Guerraoui R., Hadzilacos V., Kouznetsov P. and Toueg S., The Weakest Failure Detetors to Solve Certain Fundamental Problems in Distributed Computing. Proc. 23h ACM Symposium on Principles of Distributed Computing (PODC'04), ACM Press, pp. 338--346, St-John's (Newfoundland, Canada), July 2004.]]
[21]
Delporte-Gallet C., Fauconnier H., Helary J.-M. and Raynal M. Early Stopping in Global Data Computation. IEEE Transactions on Parallel and Distributed Systems, 14(9):909--921, 2003.]]
[22]
Fetzer C., Raynal M. and Tronel F., An Adaptive Failure Detection Protocol. Proc. 8th IEEE Pacific Rim Int. Symposium on Dependable Computing (PRDC'01), IEEE Computer Society Press, pp. 146--153, Seoul (Korea), 2001.]]
[23]
Fischer M. J., Lynch N. and Paterson M. S., Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM, 32(2):374--382, 1985.]]
[24]
Friedman R., Mostefaoui A. and Raynal M., A Weakest Failure Detector-Based Asynchronous Consensus Protocol for f < n. Information Processing Letters, 90(1):39--46, 2004.]]
[25]
Fromentin E., Raynal M. and Tronel F. On Classes of Problems in Asynchronous Distributed Systems with Process Crashes. 19th IEEE Int. Conf. on Distributed Computing Systems (ICDCS'99), Austin, TX, pp. 470--477, 1999.]]
[26]
Guerraoui R., Indulgent Algorithms. Proc. 19th ACM Symposium on Principles of Distributed Computing, (PODC'00), ACM Press, pp. 289--298, Portland (OR), 2000.]]
[27]
Guerraoui R., Non-Blocking Atomic Commit in Asynchronous Distributed Systems with Failure Detectors. Distributed Computing, 15:17--25, 2002.]]
[28]
Guerraoui R. and Kouznetsov P., On the Weakest Failure Detector for Non-Blocking Atomic Commit. Proc. 2nd Int. IFIP Conference on Theoretical Computer Science (TCS'02), pp. 461--473, Montréal (Canada), August 2002.]]
[29]
Guerraoui R. and Raynal M., The Information Structure of Indulgent Consensus. IEEE Transactions on Computers. 53(4), 53(4):453--466, April 2004.]]
[30]
Gupta I., Chandra T. D. and Goldszmidt G. S., On Scalable and Efficient Distributed Failure Detectors. Proc. 20th ACM Symposium on Principles of Distributed Computing (PODC'01), ACM Press, pp. 170--179, Newport (RI), 2001.]]
[31]
Hadzilacos V. and Toueg S., Reliable Broadcast and Related Problems. In Distributed Systems, ACM Press (S. Mullender Ed.), New-York, pp. 97--145, 1993.]]
[32]
Hélary J.-M., Hurfin M., Mostefaoui A., Raynal M. and Tronel F., Computing Global Functions in Asynchronous Distributed Systems with Process Crashes. IEEE Transactions on Parallel and Distributed Systems, 11(9):897--909, 2000.]]
[33]
Hopcroft J. E. and Ullman J. D. Introduction to Automata Theory, Languages and Computation. Addison Wesley, Reading, Mass., 418 pages, 1979.]]
[34]
Hurfin M., Mostefaoui A. and Raynal M., A Versatile Family of Consensus Protocols Based on Chandra-Toueg's Unreliable Failure Detectors. IEEE Transactions on Computers, 51(4):395--408, 2002.]]
[35]
Hurfin M. and Raynal M., A simple and Fast Asynchronous Consensus Protocol Based on a Weak Failure Detector. Distributed Computing, 12(4):209--223, 1999.]]
[36]
Koo R. and Toueg S., Effects of Message Loss on the Termination of Distributed Protocols. Information Processing Letters, 27:181--188, 1987.]]
[37]
Lamport L., Proving the Correctness of Multiprocess Programs. IEEE Transactions on Software Engineering, SE-3(2):125--143, 1977.]]
[38]
Larrea M., Arèvalo S. and Fernández A., Efficient Algorithms to Implement Unreliable Failure Detectors in Partially Synchronous Systems. Proc. 13th Symposium on Distributed Computing (DISC'99), Bratislava (Slovakia), Springer Verlag LNCS #1693, pp. 34--48, 1999.]]
[39]
Larrea M., Fernández A. and Arèvalo S., Optimal Implementation of the Weakest Failure Detector for Solving Consensus. Proc. 19th Symposium on Reliable Distributed Systems (SRDS'00), IEEE Computer Society Press, pp. 52--60, Nuremberg (Germany), 2000.]]
[40]
Mostefaoui A., Mourgaya E. and Raynal M., An Introduction to Oracles for Asynchronous Distributed Systems. Future Generation Computer Systems, 18(6):757--767, 2002.]]
[41]
Mostefaoui A., Mourgaya E., and Raynal M., Asynchronous Implementation of Failure Detectors. Proc. Int. IEEE Conference on Dependable Systems and Networks (DSN'03), IEEE Computer Society Press, pp. 351--360, San Francisco (CA), 2003.]]
[42]
Mostefaoui A., Powell D., and Raynal M., A Hybrid Approach for Building Eventually Accurate Failure Detectors. 10th IEEE Pacific Rim Int. Symposium on Dependable Computing (PRDC'2004), IEEE Computer Society Press, pp. 57--65, Papeete (Tahiti, France), 2004.]]
[43]
Mostefaoui A., S. Rajsbaum S. and Raynal M., Versatile and Modular Consensus Protocol. Int. IEEE/IFIP Conf. on Dependable Systems and Networks (DSN'02), IEEE Computer Society Press, pp. 364--373, Washington DC, 2002.]]
[44]
Mostefaoui A., Rajsbaum S. and Raynal M., Conditions on Input Vectors for Consensus Solvability in Asynchronous Distributed Systems. Journal of the ACM, 50(6):922--954, 2003.]]
[45]
Mostefaoui A., S. Rajsbaum S. and Raynal M., The Combined Power of Conditions and Information on Failures to Solve Asynchronous Set Agreement. Tech Report #1688, IRISA, Université de Rennes (France), 2005. http://www.irisa.fr/bibli/publi/pi/2005/1688/1688.html]]
[46]
Mostefaoui A. and Raynal M., Solving Consensus Using Chandra-Toueg's Unreliable Failure Detectors: a General Quorum-Based Approach. Proc. 13th Symp. on DIStributed Computing (DISC'99), Springer Verlag LNCS #1693, pp. 49--63, Bratislava (Slovakia), 1999.]]
[47]
Mostefaoui A. and Raynal M., Low-Cost Consensus-Based Atomic Broadcast. 7th IEEE Pacific Rim Int'l Symposium on Dependable Computing (PRDC'2000), IEEE Computer Society Press, UCLA, Los Angeles (CA), pp. 45--52, 2000.]]
[48]
Mostefaoui A. and Raynal M., Leader-Based Consensus. Parallel Processing Letters, 11(1):95--107, 2001.]]
[49]
Mostefaoui A., Raynal M. and Travers C., Crash-Resilient Time-free Eventual Leadership. Proc. 23th IEEE Symposium on Reliable Distributed Systems (SRDS'04), IEEE Computer Society Press, pp. 208--217, Florianõpolis (Brasil), October 2004.]]
[50]
Mostefaoui A., Raynal M. and Tronel F., The Best of Both Worlds: a Hybrid Approach to Solve Consensus. Proc. Int. Conference on Dependable Systems and Networks (DSN'00), IEEE Computer Society Press, pp. 513--522, New-York City, 2000.]]
[51]
Pease L., Shostak R. and Lamport L., Reaching Agreement in Presence of Faults. Journal of the ACM, 27(2):228--234, 1980.]]
[52]
Pedone F. and Schiper A., Handling Message Semantics with Generic Broadcast Protocols. Distributed Computing, 15(2):97--107, 2002.]]
[53]
Powell D., Failure Mode Assumptions and Assumption Coverage. Proc. of the 22nd Int'l Symposium on Fault-Tolerant Computing (FTCS-22), Boston, MA, pp. 386--395, 1992.]]
[54]
Rabin M., Randomized Byzantine Generals. Proc. 24th IEEE Symposium on Foundations of Computer Science (FOCS'83), pp. 116--124, Los Alamitos (CA), 1983.]]
[55]
Raynal M., Quiescent Uniform Reliable Broadcast as an Introduction to Failure Detector Oracles. Proc. 6th Int. Conference on Parallel Computing Technologies (PaCT'01), Novosibirsk, Springer Verlag LNCS #2127, pp. 98--111, 2001.]]
[56]
Raynal M., Detecting Crash Failures in Asynchronous Systems: What? Why? How? Tutorial given at Proc. Int. Conference on Dependable Systems and Networks (DSN'04), Florence (Italy), 2004.]]
[57]
Raynal M. and Tronel F., Group Membership Failure Detection: a Simple Protocol and its Probabilistic Analysis. Distributed Systems Engineering Journal, 6(3):95--102, 1999.]]
[58]
Schiper A. Early Consensus in an Asynchronous System with a Weak Failure Detector. Distributed Computing, 10:149--157, 1997.]]

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGACT News
ACM SIGACT News  Volume 36, Issue 1
March 2005
101 pages
ISSN:0163-5700
DOI:10.1145/1052796
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2005
Published in SIGACT Volume 36, Issue 1

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media