Example: A Membership Protocol



next up previous
Next: Protocol Properties and Up: A Framework for Protocol Previous: Common Protocol Interface

Example: A Membership Protocol

 

The Horus membership protocol, MBRSHIP, shows most of the special features of the Horus Common Protocol Interface. Consider a group of communicating processes. Because of various conditions, not all member processes in the group can communicate with each other at all times. Processes may crash, or the network may partition. Thus a process may not be assured that a message it sends is received by all destination members. Nor can a process be assured that a message it receives is received by other members in the destination set. This introduces a collection of failure scenarios that is difficult to deal with.

The MBRSHIP layer simulates an environment for the members of a group in which members can only fail (they cannot be slow or get disconnected) and messages do not get lost. Each member has a notion of the current view, which is an ordered list of the members. Each member in the current view is guaranteed either to accept that same view, or to be removed from that view. Messages sent in the current view are delivered to the surviving members of the current view, and messages received in the current view are received by all surviving members in the current view. This is called virtual synchrony, because all members that can communicate appear to see a failure at the same logical time, significantly reducing the number of failure scenarios.

Virtual synchrony is best understood as a simulation of fail-stop behavior-members excluded from the view may still be alive. When communication is restored, views may be merged using the merge downcall. Only if MBRSHIP were used with a perfect failure detector would this simulation be ``accurate.'' MBRSHIP relies only on reliable, FIFO ordering of messages.

At the heart of the MBRSHIP layer is the flush protocol. The flush protocol is run when a member crash is detected, or when views merge. One of the members (usually the oldest surviving member of the oldest view) is elected as the coordinator of the flushgif (see Figure 2). The coordinator broadcasts a FLUSH message to the (surviving) members in its view. All members first return any messages from failed members that are not known to have been delivered everywhere. These messages are called unstable (note that it is necessary that all members log all unstable messages). Finally, each member returns a FLUSH_OK reply message. Subsequently, the members ignore messages that they may receive from supposedly failed members, and await another VIEW installation.

  
Figure 2: This picture shows four processes: A, B, C, and D. D crashes right after sending a message M, and only C received a copy. After the crash is detected, A starts the flush protocol by multicasting to B and C. C sends a copy of M to A, which forwards it to B. After A has received replies from everyone, it installs a new view by multicasting.

Upon receiving all FLUSH_OK replies, the coordinator broadcasts any messages from failed members that are still unstable. At this point a new view may be installed. When all messages stabilize, the flush is completed. If processes fail during the process, a new round of the flush protocol may start up immediately.

Although the MBRSHIP layer is able to do its own failure recovery, it allows for external failure detection. In this case, an external service picks up communication problem-reports and other failure information, and decides whether a process is to be considered faulty or not. The output of this service can be fed to all instances of the MBRSHIP layer, so that the corresponding groups have the same (consistent) view of the environment.

The MBRSHIP and MERGE layers raise an interesting issue concerning the handling of partitioning failures in Horus. We return to this question below, in Section 9.



next up previous
Next: Protocol Properties and Up: A Framework for Protocol Previous: Common Protocol Interface



Robbert VanRenesse
Mon May 15 12:16:43 EDT 1995