Opinion
Architecture and Hardware Last byte

Q&A: Divide and Conquer

Leslie Lamport on Byzantine generals, clocks, and other tools for reasoning about concurrent systems.
Posted
  1. Article
  2. Author
2013 ACM A.M. Turing Award recipient Leslie Lamport

After following a somewhat circuitous path to the field, ACM A.M. Turing Award recipient Leslie Lamport has spent much of his career pioneering methods to help computer scientists realize the promise of distributed systems. In industrial research positions at Massachusetts Computer Associates, SRI International, DEC, and most recently at Microsoft Research, Lamport defined problems and developed both algorithms and formal modeling and verification protocols that are fundamental to the theory of distributed systems. His solutions are as elegant as they are practical.

You majored in math at MIT, then continued in the subject in graduate school at Brandeis. When did you start working with computers?

When I studied math, the field was under the influence of a mathematician named G.H. Hardy who, as I understand it, is responsible for the view that math should be pure, abstract, and as divorced as possible from the real world. That appealed to me initially, but when I went to grad school, I began to feel differently. I then took a break and spent four years teaching math at Marlboro College, a small liberal arts college in Vermont. There, I got interested in physics and went back to grad school intending to get into mathematical physics. Meanwhile, I supported myself by working at Massachusetts Computer Associates, which everyone called Compass. That is where I wound up doing what is now called computer science. I don’t think it had a name at that time.

How did you come to be involved with parallel computing?

Compass got a contract to build the FORTRAN compiler for the ILLIAC computer. The ILLIAC was a computer with 64 processors, and they needed a way for ordinary FORTRAN to be able to make use of 64 processors executing in parallel. I did the theory and the basic algorithms. I think that was probably the first thing I did that might be called real computer science.

At Compass, you read a paper by Paul Johnson and Bob Thomas that introduced the idea of using message timestamps to order the commands that are issued in a distributed system, and realized their algorithm violated causality. The resulting paper, "Time, Clocks and the Ordering of Events in a Distributed System," is still among the field’s most-cited.

In that paper, I introduced a causality ordering, in which one event precedes another if it can causally affect it. I thus modified their way of keeping clocks so that if one message could causally affect another message, it would wind up with a lower timestamp. Once you have this ordering of messages, you can use it to implement a state machine. A state machine is a mathematical abstraction of a digital system as something with a state that takes inputs; an input causes it to change to a new state and generate an output. So I realized that you could implement an arbitrary distributed system by describing what it is supposed to do as a state machine, using an algorithm that implements an arbitrary state machine in a distributed system. The simplest example I could think of was to write a state machine which solved Dijkstra’s mutual exclusion problem.

Several years later, you popularized the so-called Byzantine Generals Problem—in which a group of generals, some of whom may be traitors, have to reach a common decision—and invented a solution based on digital signatures.

The "Time, Clocks" paper introduced the idea of implementing a distributed system using a state machine, but that algorithm assumed no failures. Yet we build distributed systems because we expect things to fail. These days, when we talk about failures, we mean computers crashing. At the time, that body of practice did not exist, so I considered arbitrary failures, where a computer might produce a wrong answer.

Most people thought that to handle one bad computer, you had to have a total of three and use majority voting. But in a distributed system, a computer might tell one computer that the answer is 5, and the other computer that the answer is 2. So you actually need four computers.

However, if you use digital signatures, you can use three computers. If one process signs those messages, those two other processes can tell that it is faulty, because it signed messages with two different values. The original paper didn’t talk about Byzantine generals; it talked about interactive consistency. But I thought this was a really important problem, and I realized that giving it this catchy notion of the Byzantine generals would make the problem more popular, so I wrote another paper with the same authors, Marshall Pease and Robert Shostak.


The Parliament of Paxos story "was a complete disaster. That paper would have died except for Butler Lampson, who understood its practical significance."


Another famous metaphor of yours is the Part-time Parliament on the fictional island of Paxos.

The people at DEC were building a distributed file system. When they explained the properties they wanted—an asynchronous algorithm that would maintain consistency despite any number of non-Byzantine faults, and would make progress if any majority of the processors were working—I thought it was impossible. So I set about trying to prove that it could not be done, and I failed by discovering the algorithm, which was independently discovered by Brian Oki and Barbara Liskov around the same time.

Yet a fully asynchronous system must either never make a mistake, but perhaps never get anything done, or else be guaranteed to get something done but perhaps make some mistakes—the so-called "safety" and "liveness" properties that you coined.

Safety properties assert that the program does not do anything bad, while liveness properties assert that it eventually does something. Paxos guarantees that, assuming only crash failures, the system never does anything wrong, but in principle it is possible for it to never succeed in doing anything. In practice, as long as the system is not behaving completely erratically, things will eventually get done.

So based on the success of the Byzantine generals, you invented the story about Paxos and the parliament.

It was a complete disaster. That paper would have died except for Butler Lampson, who understood its practical significance. Eventually, other people caught on.

Let’s talk about LaTeX, the typesetting system you designed. If I understand correctly, you were writing a book yourself and were unhappy with the macros available at the time for TeX?

The system that I had been using was called Scribe. And it was really easy to use, and then TeX came along, and TeX was very hard to use, but it was also much more powerful. So my basic idea was to try to build macros that would make TeX as easy to use as Scribe.

You must have learned a lot about typography.

I think people have this sense that the purpose of typographic design is to make something look nice, whereas the real purpose is to make it easier to read.

You also introduced—and continue to refine—a powerful formalism for describing and reasoning about concurrent systems known as Temporal Logic of Actions, or TLA.

Amir Pnueli introduced temporal logic to computer science, and people quickly started using it as a specification language. I soon realized it was exactly what you wanted to use to reason about liveness.

Amir’s original logic was beautiful and simple, but it was not powerful enough to express the things that we wanted to express in writing specifications.

The approach most people took was to develop more complicated temporal operators, but I realized that the best way to specify safety properties is with my old friend the state machine. And then I had the wonderful idea of extending Amir’s logic not by extending the temporal operators, but by extending the class of formulas to which you apply the temporal operators. So I was able to describe the state machine in a very straightforward and practical way as a temporal logic formula, and then adding the liveness condition was just another part of the formula, so I could describe the entire specification as a single temporal logic formula, but in a practical way.

Back to Top

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More