The Byzantine Generals Problem

L. Lamport, R. Shostak and M. Pease. ACM Transactions on Programming Languages and Systems, 4(3):382-401, July 1982

Notes by Indranil Gupta, March 08, 1999.

Adapted from

original notes by Xun Wilson Huang.
original notes by Lawrence Kesteloot.

Thanks to: Xun Wilson Huang, Lawrence Kesteloot.

Problem Statement: Byzantine Generals Problem(BGP)

The setting: There are n generals, one of them the commanding general. Generals can send (and receive) messages from other generals.

The problem: Develop a communication protocol for the commanding general to send an order to the n-1 lieutenant generals so that

All loyal lieutenants obey the same order
If the commanding general is loyal, every loyal lieutenant obeys the order he sends.

Adversary: Any of the generals could be traitors i.e., could send inconsistent messages regarding the order to other generals.

Impossibility Results

For n = 3 generals and 1 traitor, there is no solution (protocol). This is because a loyal lieutenant cannot distinguish who is the traitor when he gets conflicting information from the commander and the other lieutenant. Let's call this the 3-Generals Problem.
BGP for n < 3m+1 generals and m traitors can be reduced to the 3 - generals problem, with each of the Byzantine generals simulating at most m lieutenants and taking the same decision as the loyal lieutenants they simulate. Thus BGP for n < 3m+1 and m traitors is not solvable.
Reaching approximation is as hard as reaching agreement.

I. A solution with oral messages for n > 3m

A solution for BGP with n > 3m and upto m traitors, is given.

Oral message system properties:

A1. Every message that is sent is delivered correctly. -> No message loss.
A2. The receiver of a message knows who sent it. -> Completely connected network with reliable links(due to A1).
A3. The absence of a message can be detected. -> Synchronous system only.

Every general can send a message to every other general.

Solution in brief:

uses a function majority which takes in a set of values and returns the value that is the majority among them (a possible implementation - median of the values).
uses 'rounds' in each of which a general broadcasts the value he has received in the earler round to all the other generals through whom the value has not passed before he received it.
when returning from the round, for each j, any two loyal lieutenants receive the same vector of values {v1, ... v(n-1)}. As the majority of the loyal lieutenants' values in these is ensured, applying the majority function on {v1, ... v(n-1)} to obtain vn preserves the above fact (that any two loyal lieutenants receive the same vector of values {v1, ... vn}). This ensures that BGP is solved.

Note: If the commander is not a traitor, we can be done in 2 rounds. If the commander is a traitor, you may need upto m+1 rounds.

II. A solution with (unforgable) signed messages

The difficulty of BGP is in the ability of a traitor lieutenant to lie about the commander's order. If we can restrict this ability by making the following assumptions, BGP is solvable with any number of traitors as long as their maximum number is known.

Signed messages:

A4. In addition to the 3 assumptions made in the solution with oral messages, we add the following assumption.
1. A loyal general's signature cannot be forged, any alteration can be detected. -> can drop a message, but can't change it
2. Any one can verify the authenticity of a signature. -> no one can fool a general

Again, assume a fully connected message graph among the generals.

Solution in brief:

Uses a majority-like function called choice.

The solution:

the commander sends a signed order to lieutenants
if a lieutenant receives an order from some one (either from commander directly, or from other lieutenants), he verifies it and then puts it in a set V if it's not already there. Relay the order if there are less than m distinct signatures on the order.
Everyone halts at round m+2, and use choice(V) as the desired action

The algorithm is to make all loyal lieutenants keep the same set of V, thus choice(V) is the same. If the commander is loyal, the algorithm works because all loyal lieutenants have the correct order by round 1 and by unforgablity no more orders can be produced. If the commander is not loyal, by running the algorithm to round m+1, at least one loyal lieutenant will get the order before round m( because there are only m traitors). And that loyal lieutenant will send it to all others. In short, if one loyal lieutenant gets an order, all loyal lieutenants will get it in the next round.

III. IV. Relaxing the assumption on full-connectivity of the generals graph - extending above solutions

The previous 2 solutions can be extended to relax the assumption that the message graph among the generals is fully connected.

Oral messages: Solution with oral messages is extended to solve BGP with upto m traitors in a p-regular graph with m>0 and p>3m-1.
Unforgable messages: Earlier solution with signed messages solves BGP with upto m traitors in (m+d-1) rounds, where d is the diameter of the subgraph of loyal generals. Assumption here: subgraph of loyal generals is connected (this can be relaxed by relaxing the problem statement of BGP)

Practical use of BGP in building real life systems

The best way to provide faul-tolerant decision-making in redundant systems is by majority voting. A faulty input devices may generate meaningless inputs, but majority voting would ensure that the same meaningless values are used.

For majority voting to yield a reliable system, the following 2 conditions must be satisified

All non-faulty processors must use the same input value
If input unit is non-faulty, then all non-faulty processes use the value it provides

But these are just the requirements of the BGP !

So we can apply the above solutions to the BGP in real-life. Now what about the practicality of the assumptions made by those solutions ?

About A1: In real life, link failures occur. However, link failures are indistinguishable with failures of processors, therefore we can count the link failures as one of the m. Signed message is insensitive to link failures because no message can be forged even if links fail.

About A2: What is actually required is that no traitor can forge a non-faulty process' message. A2 not needed in the solution with signed messages.

About A3: In an asynchronous system, this condition cannot be satisfied. It is usually implemented via time-outs.

About A4: Signing message has 2 aspects:

If processor is non-faulty, then no faulty processor can generate S(M). This can never be solved in real-life - only its probability of failure reduced.

Given M and X, any one can verify if X == S(M). This is doable in real world.

Further Observations

Optimizations for the BGP solution.
- combine messages to reduce the total number of messages.
- reduce the amount of information transferred.
BGP required in the most general undecidable case of process failure.
Solution presented is optimal because Fischer and Lynch have proved that any solution to the BGP necessarily has each lieutenant wait for a message that has passed through the hands of at least m generals after the commander.
Solutions to clock synchronization (needed for the implementation of above BGP solutions) - very similar to the solutions for BGP.
Further impossibility results
- BGP with messages transmitted arbitrarily quickly with upper bound on message transmission delay.
- Consensus with restricting traitors to fail-crash only.
BGP works but is inherently expensive, especially in terms of number of messages O(m !). So it's a trade-off between performance and reliability. If you want more reliability in the most general failure conditions, you have to settle for a (costly) BGP solution. If, however, you can relax the failure conditions in your systems (ex. assume only fail-crash processes in a synchronous system and leave it to God to ensure that), you can go for cheaper solutions.

Critique and Questions

Graph connectivity. Are p-regular topologies that frequent ? Can we extend the BGP solutions to any network topology ? Has it been extended to any other topologies ?
Value of m: How would one obtain a reasonable value for maximum m in a practical system (note that this maximum number is required even in the solution with signed messages).
Synchronous/asynchronous systems: How many synchronous system do we really use (SMP machines, and?) How about asynchronous systems ?
Further work after this paper:
- What other solutions to BGP have been proposed after this paper ?
- Has any attempt been made to extend the BGP solutions to asynchronous systems to ensure 'some degree/probability' of reliability ?
- Answers in next section ;-)
Bounds on best possible BGP solution (in terms of messages) ?

Further readings

Impossibility/necessity results

Fischer, M. J., Lynch, N. A., and Paterson, M. S. ``Impossibility of Distributed Consensus with One Faulty Process,'' J. ACM 32, 2 (April 1985), 374--382.
Dolev, D., Dwork, C., and Stockmeyer, L. ``On the Minimal Synchronism Needed for Distributed Consensus,'' J. ACM 34, 1 (January 1987), 77--97.

Approximate agreement

Bracha, G. ``An O(log n) Expected Rounds Randomized Byzantine Generals Protocol,'' J. ACM 34, 4 (October 1987), 910--920.
Bracha, G. and Toueg, S. ``Asynchronous Consensus and Broadcast Protocols,'' J. ACM 32, 4 (October 1985), 824--840.
Ben-Or, M. ``Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols,'' ACM Symposium on Principles of Distributed Computing, 1983, 27--30.
Dolev, D., Lynch, N. A., Pinter, S. S., Stark, E. W., and Weihl, W. E. ``Reaching Approximate Agreement in the Presence of Faults,'' J. ACM 33, 3 (July 1986), 499--516.
Dolev, D., Ruediger, R., and Strong, H. R. ``Early Stopping in Byzantine Agreement,'' J. ACM 37, 4 (October 1990), 720--741.
Hadzilacos, V. and Halpern, J. Y. ``Message-Optimal Protocols for Byzantine Agreement,'' ACM Symposium on Principles of Distributed Computing, 1991, 309--323.
Halpern, J. Y., Moses, Y., and Waarts, O. ``A Characterization of Eventual Byzantine Agreement,'' ACM Symposium on Principles of Distributed Computing, 1990, 333--346.

Failure detectors

Chandra, T. D., Hadzilacos, V., and Toueg, S. ``The Weakest Failure Detector for Solving Consensus,'' ACM Symposium on Principles of Distributed Computing, 1992, 147--158.
Chandra, T. D. and Toueg, S. ``Unreliable Failure Detectors for Asynchronous Systems,'' ACM Symposium on Principles of Distributed Computing, 1991, 325--340.