Q&A: The Power of Distribution

Drawn to the subject by its elegance, MIT professor Nancy Lynch has spent her career making sense of computational complexity while establishing the theoretical foundations of distributed computing. The FLP impossibility proof, among her best-known results, helped define the limitations of distributed systems. Input-output automata offered a valuable framework for verifying distributed algorithms. More recently, she has helped develop algorithms for dynamic networks and, during a fellowship year at the Radcliffe Institute for Advanced Study, begun to investigate a distributed approach to biological systems.

You were born into modest circumstances in Borough Park, Brooklyn. What drew you to math and computer science?

I don’t come from an academic family. But I did well in math and got into Hunter High School, which was, at the time, a school for gifted girls—it is now co-ed. At Hunter, I had a wonderful mentor, Dr. Harry Ruderman, who adopted me as his protégé and encouraged me to explore advanced math problems. Then I went to Brooklyn College, took the Putnam exam, and got a great deal of attention and encouragement from the math faculty because I ranked in the top 80 or so nationwide. And after all that, I got into MIT with an NSF graduate fellowship.

Was it at MIT you were introduced to the field of theoretical computer science?

Right away, I lucked into taking Hartley Roger’s course on recursive functions. I also took Seymour Papert’s course on automata theory. I took other classes, of course—traditional math courses like algebra and analysis—but when it came time to choosing a research project, it seemed like all the other topics were already very well developed, and that it would be hard to make a big contribution. So at that point, two years in, I moved toward the newer areas of computational complexity theory and algorithms, where there was much more opportunity to have an impact. I was lucky enough to join a new and active group working in these areas, led by Albert Meyer and Mike Fischer.

After you finished your Ph.D., you had a series of jobs in the math departments of Tufts, the University of Southern California, and Florida International.

Yes, my husband and I had a two-body problem, so we kind of moved around. But math departments were not hiring very much, so in 1977, I went to Georgia Tech as an associate professor of computer science. At Georgia Tech, I was surrounded by applied computer scientists, so I abandoned working in abstract complexity theory and started looking at computer systems. Distributed systems were just beginning to be important at that time, and there were other people at Georgia Tech who were interested in building them. I decided there must be some interesting mathematics to be developed, and began to work on developing a theory for distributed systems.

That work put you back in touch with Michael Fischer, with whom you had worked at MIT.

Yes, Mike and I started working together on this, going back and forth between Georgia Tech and the University of Washington, where he was at the time. We made a lot of progress on this new theory very quickly, and in 1981, on the strength of that work, I went to MIT on a sabbatical, got a tenured offer the next year, and stayed. And I have been here ever since.

Your most famous result in distributed computing is the so-called FLP impossibility proof of 1985, which proves that asynchronous systems cannot reach consensus in the presence of one or more failures. Can you talk about how you reached it?

We were studying different models of distributed computing, both synchronous and asynchronous. In synchronous models, computation occurs in lock-step rounds. In asynchronous models, there is no common notion of time, and processes can move at arbitrarily different speeds.

Researchers like Leslie Lamport, Danny Dolev, and Ray Strong were studying consensus problems in the presence of failures in synchronous models, in the form of Byzantine agreement. They were also studying fault-tolerant clock synchronization. From that problem, I defined an easier problem of "approximate agreement" on real values, where everybody starts from a real value and has to agree on some value that is in the range of all the other values. We studied that first in synchronous models, and then we saw we could extend the result to asynchronous models. Putting it all together, it seemed pretty natural to consider the problem of exact agreement in asynchronous systems.

Another impetus was the then-current work on database transaction commits. This is a critical example of a practical problem of exact agreement on whether a transaction should commit or abort. It is important in practice for the solution to tolerate some failures, though not necessarily Byzantine failures—just simple stopping failures. And an asynchronous model would be appropriate, because you couldn’t realistically assume absolute bounds on the message delays.

How did your work proceed from there?

At first I thought that we might come up with an algorithm for the asynchronous case of this problem, like we had for approximate agreement. But our attempts failed, so we started trying to find an impossibility result. We went back and forth, working on both directions. We narrowed in on the solution relatively quickly—it didn’t take more than a few weeks. Formulating the ideas nicely, in terms of concepts like bivalence, came a bit later.

When did you realize FLP’s significance?

I think we understood the practical significance for transactions relatively quickly, but we did not predict the impact it would have on later research. Theoreticians have developed many results that extend FLP to other problems, and many results that circumvent the limitation using such methods as randomization and failure detectors. Most interestingly, I think, is that FLP triggered the development of algorithms that established a clear separation of requirements for fault-tolerant consensus problems: safety properties of agreement and validity, which are required to hold always, and termination properties, which are required to hold during stable periods. These algorithms are not only interesting theoretically, but provide interesting guidelines for development of practical fault-tolerant systems.

In the 1980s, you also began work on input-output, or I/O, automata, which are used to model distributed algorithms.

Mark Tuttle and I developed the I/O automata modeling framework for asynchronous distributed systems early on, in 1987. We had some asynchronous distributed algorithms and we wanted to prove that they worked, but we were doing a lot of work to define our models and found that we were repeating that work in different papers. So we stepped back and developed a rigorous math model for systems with interacting components.

Later, you extended the work to cover synchronous systems, as well.

The I/O automata framework doesn’t deal with timing, so we defined another model, the Timed I/O Automata model, to cover synchronization. This is what we use as the foundation of our work on algorithms for mobile systems and wireless networks. My student Roberto Segala also worked with me to develop probabilistic versions, which are useful for describing randomized algorithms and security protocols.

So you have various frameworks that support the description of individual components in a system, and can then be used to produce a model for the entire system.

I don’t think the effort in developing these models is done yet. It would be nice to combine all the frameworks into one that includes discrete, continuous, timed, and probabilistic features, which is what’s needed to understand modern systems.

Let’s talk about some of your more recent work.

For the past 10 years or so, my group and I have been working on distributed algorithms for dynamic networks, in which the network changes over time because participating nodes can join, leave, fail, recover, and move, all while the algorithm is operating. We have designed algorithms that maintain consistent data, synchronize clocks, compute functions, and coordinate robots. We have also worked quite a bit recently on low-level wireless communication issues—managing contention among different senders in wireless networks.

Are there certain techniques, principles, or characteristics you have found helpful, or does every fickle network bring its own set of problems?

Some common techniques emerge. For example, we try to implement abstraction layers, which are basically simpler models, over more complex models. You could have a Virtual Node layer that adds fixed nodes at known locations to a mobile wireless network and makes it easier to write higher-level algorithms. Or a Reliable Local Broadcast layer that masks issues of contention management in wireless networks, producing a more reliable substrate for writing higher-level algorithms.

Various algorithmic techniques do also recur, such as quorum-based reliable data management, random methods for information dissemination, and back-off protocols for scheduling transmissions.

You have also begun working on biologically inspired distributed algorithms. Can you talk a bit about that work?

I’m really just starting on this, but the idea is that biological systems are a lot like distributed algorithms. Why? Because they consist of many components, interacting to accomplish a common task, and communicating mainly with nearby components. So I’m reading about, for example, self-organizing systems of insects and bacteria, systems of cells during development, and neural networks, and I’m trying to apply a distributed algorithms viewpoint. It’s too early to see what will emerge, but surely we can define models, state problems, describe systems at different levels of abstraction as distributed algorithms, analyze the algorithms, and maybe even prove lower bounds.

Q&A: The Power of Distribution

DOI

March 2013 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Q&A: The Power of Distribution

DOI

March 2013 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.