Wednesday, June 13, 2007

A brief history of Consensus, 2PC and Transaction Commit.

This is a potted history of consensus, transactions and 2PC. Reading the literature on consensus is difficult because the language changes (consensus was originally called agreement), the results come in an order that isn't logical, and the whole framework for describing distributed algorithms evolved in parallel with the work. Also, there are few books other than Lynch's Distributed Algorithms that cover the subject.

Papers are discussed in the order that makes most sense, not in the order they were published.

The first instance of the consensus problem that I am aware of is in Lamport's "Time, Clocks and the Ordering of Events in a Distributed System" (1978), though it is not explicitly declared as a consensus or agreement problem. In this paper Lamport discusses how messages take a finite time to travel between processors and draws an analogy with Einstein's special relativity. Discussing Einstein's theory with respect to distributed systems is popular recently in the blogsphere, but in 1978 Lamport give a complete analysis with space-time diagrams and all. The issue is that in a distributed system you cannot tell if event A happened before event B, unless A caused B in some way. Each observer can see events happen in a different order, except for events that cause each other, ie there is only a partial ordering of events in a distributed system. Lamport defines the "happens before" relationship and operator, and goes on to give an algorithm that provides a total ordering of events in a distributed system, so that each process sees events in the same order as every other process.

Lamport also introduces the concept of a distributed state machine: start a set of deterministic state machines in the same state and then make sure they process the same messages in the same order. Each machine is now a replica of the others. The key problem is making each replica agree what is the next message to process: a consensus problem. This is what the algorithm for creating a total ordering of events does, it provides an agreed ordering for the delivery of messages. However, the system is not fault tolerant; if one process fails that others have to wait for it to recover.

Around the same time as this paper, Gray described 2PC in "Notes on Database Operating Systems" (1979). Unfortunately 2PC would block if the TM (Transaction Manager) fails at the wrong time. Skeen showed in "NonBlocking Commit Protocols" (1981)that for a distributed transactions you needed a 3 phrase commit algorithm to avoid the blocking problems associated with 2PC. The problem was coming up with a nice 3PC algorithm, this would only take nearly 25 years!

Fischer, Lynch and Paterson showed that distributed consensus was impossible in an asynchronous system with just one faulty process in "Impossibility of distributed consensus with one faulty process" (1985), this famous result is known as the "FLP" result. By this time "consensus" was the name given to the problem of getting a bunch of processors to agree a value. In an asynchronous system (where processors run at arbitrary speeds and messages can take an arbitrarily long time to travel between processors) with a perfect network (all messages are delivered, messages arrive in order and can not be duplicated) distributed consensus is impossible with just one faulty process (even just a fail-stop). The kernel of the problem is that you cannot tell the difference between a process that has stopped and one that is running very slowly, making dealing with faults in an asynchronous system almost impossible. The paper was also important because it demonstrated how to show something was impossible: show that all algorithms that solve the problem must have some property, then show that this property is impossible, ie proof by contradiction. (This approach was only re-learned as Turing used it in the halting problem)

By this stage people realized that a distributed algorithm has two properties: safety and liveness. Safety means nothing bad happens, while liveness means that something good eventually happens. 2PC is an asynchronous consensus algorithm, all processes must agree on either commit or abort for a transaction. 2PC is safe: no bad data is ever written to the databases, but its liveness properties aren't great: if the TM fails at the wrong point the system will block.

Also by this stage people thought of distributed systems as being synchronous (processes run at known rates, and messages are delivered in known bounds of time) or asynchronous (processes run at unknown and arbitrary rates, and messages can take unbounded time to be delivered). The asynchronous case is more general than the synchronous case: an algorithm that works for an asynchronous system will also work for a synchronous system, but not vice versa. You can treat a synchronous system as a special case of an asynchronous system that just happens to have bounds on the time it takes to deliver a message.

Before FLP, there was the "The Byzantine Generals Problem" (1982) paper. In this form of the consensus problem the processes can lie, and they can actively try to deceive other processes. This problem looks harder than the FLP result, but it does have a solution for the synchronous case (though when the Byzantine Generals paper was written the distinction between asynchronous and synchronous systems was not explicit). The solution is expensive in the number of messages exchanged, and the number of rounds of messages required. The problem originally came from the aerospace industry: what would happen if sensors gave false information on an plane (clearly the system could be treated as synchronous).

In 1986 there was a get together of the distributed systems people who were interested in consensus and the transaction people. At the time the best consensus algorithm was the Byzantine Generals, but this was too expensive to use for transactions. Jim Gray wrote up a note on the meeting: "A Comparison of the Byzantine Agreement Problem and the Transaction Commit Problem." (1987) .

The paper contains this in the introduction :-)

"Prior to the conference, it was widely believed that the transaction commit problem faced by distributed systems is a degenerate form of the Byzantine Generals Problem studied by academe. Perhaps the most useful consequence of the conference was to show that these two problems have little in common."

Eventually distributed transactions would be seen as a version of consensus, called uniform consensus (see "Uniform consensus is harder than consensus" (2000)). With uniform consensus all processes must agree on a value, even the faulty ones - a transaction should only commit if all RMs are prepared to commit. Most forms of consensus are only concerned with having the non-faulty processes agree. Uniform consensus is more difficult than general consensus.

Eventually Lamport came up with the Paxos consensus algorithm, described in "The Part-Time Parliament" (submitted in 1990, published 1998). Unfortunately the analogy with Greek democracy failed badly with people finding the paper very difficult to understand, and the paper was ignored until its case was taken up by Butler Lampson in "How to Build a Highly Availability System using Consensus" (1996). This paper provides a good introduction to building fault tolerant systems and Paxos. Later Lamport would publish "Paxos Made Simple (2001).

The kernel of Paxos is that given a fixed number of processes, any majority of them must have at least one process in common. For example given three processes A, B and C the possible majorities are: AB, AC, or BC. If a decision is made when one majority is present eg AB, then at any time in the future when another majority is available at least one of the processes can remember what the previous majority decided. If the majority is AB then both processes will remember, if AC is present then A will remember and if BC is present then B will remember.

Paxos can tolerate lost messages, delayed messages, repeated messages, and messages delivered out of order. It will reach consensus if there is a single leader for long enough that the leader can talk to a majority of processes twice. Any process, including leaders, can fail and restart; in fact all processes can fail at the same time, the algorithm is still safe. There can be more than one leader at a time.

Paxos is an asynchronous algorithm; there are no explicit timeouts. However, it only reaches consensus when the system is behaving in a synchronous way, ie messages are delivered in a bounded period of time; otherwise it is safe. There is a pathological case where Paxos will not reach consensus, in accordance to FLP, but this scenario is relatively easy to avoid in practice.

Clearly dividing systems into synchronous and asynchronous is too broad a distinction, and Dwork, Lynch and Stockmeyer defined partially synchronous systems in "Consensus in the presence of partial synchrony" (1988) . There are two versions of partial synchronous system: in one processes run at speeds within a known range and messages are delivered in bounded time but the actual values are not known a priori; in the other version the range of speeds of the processes and the upper bound for message deliver are known a priori, but they will only start holding at some unknown time in the future. The partial synchronous model is a better model for the real world than either the synchronous or asynchronous model; networks function in a predicatable way most of the time, but occasionally go crazy.

Lamport and Gray went on to apply Paxos to the distributed transaction commit problem in "Consensus on Transaction Commit" (2005). They used Paxos to effectively replicate the TM of 2PC, and used an instance of Paxos for each RM involved in the transaction to agree whether that RM could commit the transaction. On the face of it, using an instance of Paxos per RM looks expensive, but it turns out that it is not. Paxos Commit will complete in two phases for the fault free case, ie it has the same message delay as 2PC, though more messages are exchanged. A third phase is only required if there is a fault, in accordance to the Skeen result. Given 2n+1 TM replicas Paxos Commit will complete with up to n faulty replicas. Paxos Commit does not use Paxos to solve the transaction commit problem directly, ie it is not used to solve uniform consensus, rather it is used to make the system fault tolerant.

Any argument that distributed transactions should not be used because 2PC is blocking is a void, because Paxos Commit addresses the blocking issue.

Recently there has been some discussion of the CAP conjecture: Consistency, Availability and Partition. The conjecture asserts that you cannot have all three in a distributed system: a system that is consistent, that can have faulty processes and that can handle a network partition.

We can examine CAP by equating consistency with consensus. For an asynchronous system we cannot reach consensus with one faulty process, FLP, so we cannot have consistency and availability for an asynchronous system!

Now take a Paxos system with three nodes: A, B and C. We can reach consensus if two nodes are working, ie we can have consistency and availability. Now if C becomes partitioned and C is queried, it cannot respond because it cannot communicate with the other nodes; it doesn't know whether it has been partitioned, or if the other two nodes are down, or if the network is being very slow. The other two nodes can carry on, because they can talk to each other and they form a majority. So for the CAP conjecture, Paxos does not handle a partition because C cannot respond to queries. However, we could engineer our way around this. If we are inside a data center we can use two independent networks (Paxos doesn't mind if messages are repeated). If we are on the internet, then we could have our client query all nodes A, B and C, and if C is partitioned the client can query A or B unless it is partitioned in a similar way to C.

For a synchronous network, if C is partitioned it can learn that it is partitioned if it does not receive messages in a fixed period of time, and thus can declare itself down to the client.

Paxos, Paxos Commit and HTTP/REST have been combined to build a highly available co-allocation system for Grid computing, details of which can be found here HARC, there are also more references in this paper: "Co-Allocation, Fault Tolerance and Grid Computing" (2006).