Failure Detectors

Work on Failure Detectors at Cornell University by M. Aguilera, T. Chandra, W. Chen, B. Deianov, V. Hadzilacos and S. Toueg.
(This work is partially supported by the National Science Foundation under grants CCR-9402896 and CCR-9711403.)

Unreliable Failure Detectors for Reliable Distributed Systems.
[CT96]
Tushar Deepak Chandra and Sam Toueg
Journal of the ACM, 43:2, March 1996, 225-267.

[Paper - PostScript]

10th Annual ACM Symposium on Principles of Distributed Computing (PODC)

<Abstract> We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties --- completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [CHT96].

The Weakest Failure Detector for Solving Consensus.
[CHT96]
Tushar Deepak Chandra, Vassos Hadzilacos and Sam Toueg
Journal of the ACM, 43:4, July 1996, 685-722.
[Paper - PostScript]

A preliminary version appeared in the 11th Annual ACM Symposium on Principles of Distributed Computing (PODC), August 1992, 147-158.

<Abstract> We determine what information about failures is necessary and sufficient to solve Consensus in asynchronous distributed systems subject to crash failures. In [CT96], we proved that <>W, a failure detector that provides surprisingly little information about which processes have crashed, is sufficient to solve Consensus in asynchronous systems with a majority of correct processes. In this paper, we prove that to solve Consensus, any failure detector has to provide at least as much information as <>W. Thus, <>W is indeed the weakest failure detector for solving Consensus in asynchronous systems with a majority of correct processes.
Randomization and Failure Detection: A Hybrid Approach to Solve Consensus.
Marcos Kawazoe Aguilera and Sam Toueg
SIAM Journal on Computing, 28:3, June 1999, 890-903.
[Paper - PostScript]
On Quiescent Reliable Communication.
Marcos Kawazoe Aguilera, Wei Chen and Sam Toueg
To appear in the SIAM Journal on Computing.
[Latest version - PostScript]
Using the Heartbeat Failure Detector for Quiescent Reliable Communication and Consensus in Partitionable Networks.
Marcos Kawazoe Aguilera, Wei Chen and Sam Toueg
Theoretical Computer Science, special issue on distributed algorithms, 220:1, June 1999, 3-30.
[Paper - PostScript]

Failure Detection and Consensus in the Crash-Recovery Model.
Marcos Kawazoe Aguilera, Wei Chen and Sam Toueg
To appear in Distributed Computing.
[Latest version - PostScript]

Revisiting the Weakest Failure Detector for Uniform Reliable Broadcast.
Marcos Kawazoe Aguilera, Sam Toueg and Borislav Deianov
Technical Report TR99-1741, Department of Computer Science, Cornell University, April 1999.
[Latest version - PostScript]
On the Quality of Service of Failure Detectors.
Wei Chen, Sam Toueg and Marcos Kawazoe Aguilera
Technical Report (in preparation), Department of Computer Science, Cornell University, March 2000.

[Latest version - PostScript]

An extended abstract will appear in the International Conference on Dependable Systems and Networks (ICDSN/FTCS-30), June 2000.

<Abstract> We study the quality of service (QoS) of failure detectors. By QoS, we mean a specification that quantifies (a) how fast the failure detector detects actual failures, and (b) how well it avoids false detections. We first propose a set of QoS metrics to specify failure detectors for systems with probabilistic behaviors, i.e., for systems where message delays and message losses follow some probability distributions. We then give a new failure detector algorithm and analyse its QoS in terms of the proposed metrics. We show that, among a large class of failure detectors, the new algorithm is optimal with respect to some of these QoS metrics. Given a set of failure detector QoS requirements, we show how to compute the parameters of our algorithm so that it satisfies these requirements, and we show how this can be done even if the probabilistic behavior of the system is not known. We then present some simulation results that show that the new failure detector algorithm provides a better QoS than an algorithm that is commonly used in practice. Finally, we briefly explain how to make our failure detector adaptive, so that it automatically reconfigures itself when there is a change in the probabilistic behavior of the network.