Work on Failure Detectors at Cornell University by M. Aguilera, T. Chandra, W. Chen, B. Deianov, V. Hadzilacos and S. Toueg.
(This work is partially supported by the National Science Foundation under grants CCR-9402896 and CCR-9711403.)
<Abstract> We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties --- completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [CHT96].
A preliminary version appeared in the 11th Annual ACM Symposium on Principles of Distributed Computing (PODC), August 1992, 147-158.
<Abstract> We determine what information about failures is necessary and sufficient to solve Consensus in asynchronous distributed systems subject to crash failures. In [CT96], we proved that <>W, a failure detector that provides surprisingly little information about which processes have crashed, is sufficient to solve Consensus in asynchronous systems with a majority of correct processes. In this paper, we prove that to solve Consensus, any failure detector has to provide at least as much information as <>W. Thus, <>W is indeed the weakest failure detector for solving Consensus in asynchronous systems with a majority of correct processes.
<Abstract> We present a Consensus algorithm that combines
randomization and unreliable failure detection, two well-known techniques
for solving Consensus in asynchronous systems with crash failures. This
hybrid algorithm combines advantages from both approaches: it guarantees
deterministic termination if the failure detector is accurate, and probabilistic
termination otherwise. In executions with no failures or failure detector
mistakes, the most likely ones in practice, Consensus is reached in only
two asynchronous rounds.
<Abstract> We study the problem of achieving reliable communication
with quiescent algorithms (i.e., algorithms that eventually stop sending
messages) in asynchronous systems with process crashes and lossy links.
We first show that it is impossible to solve this problem without failure
detectors. We then show that, among failure detectors that output lists
of suspects, the weakest one that can be used to solve this problem is
<>P, failure detector that cannot be implemented. To overcome this
difficulty, we introduce an implementable failure detector called Heartbeat
and how that it can be used to achieve quiescent reliable communication.
Heartbeat is novel: in contrast to typical failure detectors, it does
not output lists of suspects and it is implementable without timeouts.
With Heartbeat, many existing algorithms that tolerate only process crashes
can be transformed into quiescent algorithms that tolerate both process
crashes and message losses. This can be applied to consensus, atomic broadcast,
k-set agreement, atomic commitment, etc.
<Abstract> We study the problems of failure detection and consensus in asynchronous systems in which processes may crash and recover, and links may lose messages. We first propose new failure detectors that are particularly suitable to the crash-recovery model. We next determine under what conditions stable storage is necessary to solve consensus in this model. Using the new failure detectors, we give two consensus algorithms that match these conditions: one requires stable storage and the other does not. Both algorithms tolerate link failures and are particularly efficient in the runs that are most likely in practice --- those with no failures or failure detector mistakes. In such runs, consensus is achieved within 3d time and with 4n messages, where d is the maximum message delay and n is the number of processes in the system.
<Abstract> Uniform Reliable Broadcast is a communication primitive that requires that if a process delivers a message, then all correct processes also deliver this message. In a PODC'99 paper, Halpern and Ricciardi use Knowledge Theory to determine what failure detectors are necessary to implement this primitive in asynchronous systems with process crashes and lossy links that are fair. In this paper, we revisit this problem using a different approach, and provide a result that is simpler, more intuitive, and, in a precise sense, more general.
An extended abstract will appear in the International Conference on Dependable Systems and Networks (ICDSN/FTCS-30), June 2000.
<Abstract> We study the quality of service (QoS) of failure detectors. By QoS, we mean a specification that quantifies (a) how fast the failure detector detects actual failures, and (b) how well it avoids false detections. We first propose a set of QoS metrics to specify failure detectors for systems with probabilistic behaviors, i.e., for systems where message delays and message losses follow some probability distributions. We then give a new failure detector algorithm and analyse its QoS in terms of the proposed metrics. We show that, among a large class of failure detectors, the new algorithm is optimal with respect to some of these QoS metrics. Given a set of failure detector QoS requirements, we show how to compute the parameters of our algorithm so that it satisfies these requirements, and we show how this can be done even if the probabilistic behavior of the system is not known. We then present some simulation results that show that the new failure detector algorithm provides a better QoS than an algorithm that is commonly used in practice. Finally, we briefly explain how to make our failure detector adaptive, so that it automatically reconfigures itself when there is a change in the probabilistic behavior of the network.