Wednesday, June 13, 2007

A brief history of Consensus, 2PC and Transaction Commit.

This is a potted history of consensus, transactions and 2PC. Reading the literature on consensus is difficult because the language changes (consensus was originally called agreement), the results come in an order that isn't logical, and the whole framework for describing distributed algorithms evolved in parallel with the work. Also, there are few books other than Lynch's Distributed Algorithms that cover the subject.

Papers are discussed in the order that makes most sense, not in the order they were published.

The first instance of the consensus problem that I am aware of is in Lamport's "Time, Clocks and the Ordering of Events in a Distributed System" (1978), though it is not explicitly declared as a consensus or agreement problem. In this paper Lamport discusses how messages take a finite time to travel between processors and draws an analogy with Einstein's special relativity. Discussing Einstein's theory with respect to distributed systems is popular recently in the blogsphere, but in 1978 Lamport give a complete analysis with space-time diagrams and all. The issue is that in a distributed system you cannot tell if event A happened before event B, unless A caused B in some way. Each observer can see events happen in a different order, except for events that cause each other, ie there is only a partial ordering of events in a distributed system. Lamport defines the "happens before" relationship and operator, and goes on to give an algorithm that provides a total ordering of events in a distributed system, so that each process sees events in the same order as every other process.

Lamport also introduces the concept of a distributed state machine: start a set of deterministic state machines in the same state and then make sure they process the same messages in the same order. Each machine is now a replica of the others. The key problem is making each replica agree what is the next message to process: a consensus problem. This is what the algorithm for creating a total ordering of events does, it provides an agreed ordering for the delivery of messages. However, the system is not fault tolerant; if one process fails that others have to wait for it to recover.

Around the same time as this paper, Gray described 2PC in "Notes on Database Operating Systems" (1979). Unfortunately 2PC would block if the TM (Transaction Manager) fails at the wrong time. Skeen showed in "NonBlocking Commit Protocols" (1981)that for a distributed transactions you needed a 3 phrase commit algorithm to avoid the blocking problems associated with 2PC. The problem was coming up with a nice 3PC algorithm, this would only take nearly 25 years!

Fischer, Lynch and Paterson showed that distributed consensus was impossible in an asynchronous system with just one faulty process in "Impossibility of distributed consensus with one faulty process" (1985), this famous result is known as the "FLP" result. By this time "consensus" was the name given to the problem of getting a bunch of processors to agree a value. In an asynchronous system (where processors run at arbitrary speeds and messages can take an arbitrarily long time to travel between processors) with a perfect network (all messages are delivered, messages arrive in order and can not be duplicated) distributed consensus is impossible with just one faulty process (even just a fail-stop). The kernel of the problem is that you cannot tell the difference between a process that has stopped and one that is running very slowly, making dealing with faults in an asynchronous system almost impossible. The paper was also important because it demonstrated how to show something was impossible: show that all algorithms that solve the problem must have some property, then show that this property is impossible, ie proof by contradiction. (This approach was only re-learned as Turing used it in the halting problem)

By this stage people realized that a distributed algorithm has two properties: safety and liveness. Safety means nothing bad happens, while liveness means that something good eventually happens. 2PC is an asynchronous consensus algorithm, all processes must agree on either commit or abort for a transaction. 2PC is safe: no bad data is ever written to the databases, but its liveness properties aren't great: if the TM fails at the wrong point the system will block.

Also by this stage people thought of distributed systems as being synchronous (processes run at known rates, and messages are delivered in known bounds of time) or asynchronous (processes run at unknown and arbitrary rates, and messages can take unbounded time to be delivered). The asynchronous case is more general than the synchronous case: an algorithm that works for an asynchronous system will also work for a synchronous system, but not vice versa. You can treat a synchronous system as a special case of an asynchronous system that just happens to have bounds on the time it takes to deliver a message.

Before FLP, there was the "The Byzantine Generals Problem" (1982) paper. In this form of the consensus problem the processes can lie, and they can actively try to deceive other processes. This problem looks harder than the FLP result, but it does have a solution for the synchronous case (though when the Byzantine Generals paper was written the distinction between asynchronous and synchronous systems was not explicit). The solution is expensive in the number of messages exchanged, and the number of rounds of messages required. The problem originally came from the aerospace industry: what would happen if sensors gave false information on an plane (clearly the system could be treated as synchronous).

In 1986 there was a get together of the distributed systems people who were interested in consensus and the transaction people. At the time the best consensus algorithm was the Byzantine Generals, but this was too expensive to use for transactions. Jim Gray wrote up a note on the meeting: "A Comparison of the Byzantine Agreement Problem and the Transaction Commit Problem." (1987) .

The paper contains this in the introduction :-)

"Prior to the conference, it was widely believed that the transaction commit problem faced by distributed systems is a degenerate form of the Byzantine Generals Problem studied by academe. Perhaps the most useful consequence of the conference was to show that these two problems have little in common."

Eventually distributed transactions would be seen as a version of consensus, called uniform consensus (see "Uniform consensus is harder than consensus" (2000)). With uniform consensus all processes must agree on a value, even the faulty ones - a transaction should only commit if all RMs are prepared to commit. Most forms of consensus are only concerned with having the non-faulty processes agree. Uniform consensus is more difficult than general consensus.

Eventually Lamport came up with the Paxos consensus algorithm, described in "The Part-Time Parliament" (submitted in 1990, published 1998). Unfortunately the analogy with Greek democracy failed badly with people finding the paper very difficult to understand, and the paper was ignored until its case was taken up by Butler Lampson in "How to Build a Highly Availability System using Consensus" (1996). This paper provides a good introduction to building fault tolerant systems and Paxos. Later Lamport would publish "Paxos Made Simple (2001).

The kernel of Paxos is that given a fixed number of processes, any majority of them must have at least one process in common. For example given three processes A, B and C the possible majorities are: AB, AC, or BC. If a decision is made when one majority is present eg AB, then at any time in the future when another majority is available at least one of the processes can remember what the previous majority decided. If the majority is AB then both processes will remember, if AC is present then A will remember and if BC is present then B will remember.

Paxos can tolerate lost messages, delayed messages, repeated messages, and messages delivered out of order. It will reach consensus if there is a single leader for long enough that the leader can talk to a majority of processes twice. Any process, including leaders, can fail and restart; in fact all processes can fail at the same time, the algorithm is still safe. There can be more than one leader at a time.

Paxos is an asynchronous algorithm; there are no explicit timeouts. However, it only reaches consensus when the system is behaving in a synchronous way, ie messages are delivered in a bounded period of time; otherwise it is safe. There is a pathological case where Paxos will not reach consensus, in accordance to FLP, but this scenario is relatively easy to avoid in practice.

Clearly dividing systems into synchronous and asynchronous is too broad a distinction, and Dwork, Lynch and Stockmeyer defined partially synchronous systems in "Consensus in the presence of partial synchrony" (1988) . There are two versions of partial synchronous system: in one processes run at speeds within a known range and messages are delivered in bounded time but the actual values are not known a priori; in the other version the range of speeds of the processes and the upper bound for message deliver are known a priori, but they will only start holding at some unknown time in the future. The partial synchronous model is a better model for the real world than either the synchronous or asynchronous model; networks function in a predicatable way most of the time, but occasionally go crazy.

Lamport and Gray went on to apply Paxos to the distributed transaction commit problem in "Consensus on Transaction Commit" (2005). They used Paxos to effectively replicate the TM of 2PC, and used an instance of Paxos for each RM involved in the transaction to agree whether that RM could commit the transaction. On the face of it, using an instance of Paxos per RM looks expensive, but it turns out that it is not. Paxos Commit will complete in two phases for the fault free case, ie it has the same message delay as 2PC, though more messages are exchanged. A third phase is only required if there is a fault, in accordance to the Skeen result. Given 2n+1 TM replicas Paxos Commit will complete with up to n faulty replicas. Paxos Commit does not use Paxos to solve the transaction commit problem directly, ie it is not used to solve uniform consensus, rather it is used to make the system fault tolerant.

Any argument that distributed transactions should not be used because 2PC is blocking is a void, because Paxos Commit addresses the blocking issue.

Recently there has been some discussion of the CAP conjecture: Consistency, Availability and Partition. The conjecture asserts that you cannot have all three in a distributed system: a system that is consistent, that can have faulty processes and that can handle a network partition.

We can examine CAP by equating consistency with consensus. For an asynchronous system we cannot reach consensus with one faulty process, FLP, so we cannot have consistency and availability for an asynchronous system!

Now take a Paxos system with three nodes: A, B and C. We can reach consensus if two nodes are working, ie we can have consistency and availability. Now if C becomes partitioned and C is queried, it cannot respond because it cannot communicate with the other nodes; it doesn't know whether it has been partitioned, or if the other two nodes are down, or if the network is being very slow. The other two nodes can carry on, because they can talk to each other and they form a majority. So for the CAP conjecture, Paxos does not handle a partition because C cannot respond to queries. However, we could engineer our way around this. If we are inside a data center we can use two independent networks (Paxos doesn't mind if messages are repeated). If we are on the internet, then we could have our client query all nodes A, B and C, and if C is partitioned the client can query A or B unless it is partitioned in a similar way to C.

For a synchronous network, if C is partitioned it can learn that it is partitioned if it does not receive messages in a fixed period of time, and thus can declare itself down to the client.

Paxos, Paxos Commit and HTTP/REST have been combined to build a highly available co-allocation system for Grid computing, details of which can be found here HARC, there are also more references in this paper: "Co-Allocation, Fault Tolerance and Grid Computing" (2006).


Blogger Unknown said...

The comment "Any argument that distributed transactions should not be used because 2PC is blocking is a void, because Paxos Commit addresses the blocking issue" is not true. Paxos Commit is still a blocking protocol - if an RM votes "prepare" and the network fails or if the leader is not operational for long enough to talk to a majority of processes twice then the RM will be in a blocked state for potentially an indefinite period.

9:54 AM  
Blogger Mark Mc Keown said...

Hi Dean,

Once an RM has issued a prepared and then fails, then it is up to the RM to find out whether to commit or abort the transaction when it recovers. When an RM is down, Paxos Commit does not block, any transactions involving that RM will be aborted.

When a Leader fails, another acceptor should take over as Leader and try and get the transaction commited or aborted.

Paxos Commit will block under the following conditions: if there is not a majority of acceptors available, or if there are two active leaders trying to get different outcomes choosen. The second case is unlikely because any new leader in Paxos Commit will try and get Abort choosen as the outcome for the transaction.

5:39 AM  
Blogger Unknown said...

Hi Mark,

So you agree that Paxos Commit will block under 2 possible conditions. The other situation that the RM waits for potentially an indefinate period from knowing whether a transaction has committed or aborted after it has voted "yes" is if the network is down and can't communicate with any of the acceptors.

My concern with using Paxos Commit is that resource providers may have to wait indefinitely to find out if a tx has committed or aborted while in the prepared state no matter how small the probability. This is especially true when working across administrative/trust domains.


3:57 AM  
Blogger Mark Mc Keown said...

Hi Dean,
Given 5 acceptors with a mean time to failure of 120 days and a mean time to repair of 1 day, then the mean time to Paxos blocking would be 160 years. Whether this value is achievable, and what are the actual circumstances that are required to achieve it, are the key questions.


3:55 AM  
Anonymous Anonymous said...

look this is the "diet" i told you about you should really enter the site :) bye enter the site

7:09 PM  
Anonymous Anonymous said...

hi mate, this is the canadin pharmacy you asked me about: the link

10:19 AM  
Blogger Murat said...


Thank you so much for writing such a clean account of work on consensus. I really enjoyed reading it. The links are also a big plus. I wasn't aware of the consensus on transaction commit paper.

11:53 AM  
Blogger Unknown said...

Hello there!

My comment is off-topic but I was googling for Taerna and security and I ended up in your blog!

I was wondering if you have many progresses in this area or if the Taverna team has provided with any new developments.

I need to implement security in Taverna (and I need it for yesterday!) so if you any development/advice/link/whatever that you wish to share please send it to me:

Thanks in advance,
Pedro Lopes

3:32 AM  
Anonymous Anonymous said...


6:08 AM  
Anonymous Anonymous said...

最豐滿最好之稻穗,便最貼近地面 ..................................................

12:40 AM  
Anonymous Anonymous said...

this kind of blog always useful for blog readers, it helps people during research. your post is one of the same for blog readers.

Thesis Papers Writing

5:26 AM  
Blogger Christina Gomes said...

This is actually true when one leader fails the other leaders will take place all his/her assets.
student loan

10:36 PM  
Blogger Christina Gomes said...

obviously if one leader lose the other one take place.
dixv downloads

1:17 AM  
Blogger James said...

I think you should take help from any experienced person.
Online Muslim Matrimonials

11:16 PM  
Blogger Make Shake said...

Thanks for the explanations and resources you have provided here. I am finding it a bit hard to follow it but I believe I will get there. vehicle insurance companies

10:25 AM  
Blogger luck said...

This is a good post. This post gives truly quality information. I’m definitely going to look into it. Really very useful tips are provided here. Thank you so
Round Rock Homes

7:40 PM  
Blogger Dembala said...

I have joined your rss feed and sit up for in the hunt for extra of your excellent post. Also, I have shared your web site in my social networks
my car insurance quotes

7:37 AM  
Blogger Unknown said...

hello!! Very interesting discussion glad that I came across such informative post. Keep up the good work friend. Glad to be part of your net community.
advertising | top advertising agencies in Pakistan | Marketing Agency | Advertisement

3:58 AM  
Blogger Adam Dziedzic said...

Good read, thank you! Regarding the FLP theorem, your statement about the network assumptions "messages arrive in order" is incorrect. The paper on FLP: "Impossibility of Distributed Consensus with One Faulty Process" assumes that: "messages can be delayed, arbitrarily long, and delivered out of order". Moreover, the out of order message delivery assumption is used in the proof.

2:08 PM  
Blogger Martin said...

Thanks SS-web Solution! For remove my stress that the SEO best or a SEM Best. Now Know the both are important when it comes to marketing your business, but there is a difference between the two. This blog post will discuss what SEO and SEM are, how they work together for online success, and why you need them in your marketing strategy…….Read More.

12:33 AM  
Blogger SHF Collection said...

Great website, looks very clean and organized. Keep up the good work!
l shaped sofa design

1:12 PM  

Post a Comment

<< Home