News
Artificial Intelligence and Machine Learning News

Researchers Simplify Parallel Programming

Parallel computing has become increasingly important as chipmakers put more and more processor cores on individual chips.
Posted
Argonne National Laboratory technicians
  1. Introduction
  2. The World Goes Parallel
  3. Enter Transactions
  4. Going Forward
  5. Further Reading
  6. Author
  7. Figures
Argonne National Laboratory technicians
Technicians at Argonne National Laboratory work on MIRA, the fifth-fastest supercomputer in the world (as of July 2014).

Using statistical models, computer scientists have shown that certain kinds of parallel computation are not as difficult as previously thought. Researchers from the Massachusetts Institute of Technology (MIT), Microsoft Research, and the Israel Institute of Technology (Technion) showed that for a large class of non-blocking parallel programs, lock-free algorithms will perform fully as well as the more complex wait-free algorithms.

“What we have shown is that you really don’t need to design these wait-free algorithms,” said Nir Shavit, a computer science professor at MIT. “Most methods that are lock-free are also wait-free.”

The two types of algorithms are part of a class of techniques used in shared-memory systems that are non-blocking, meaning concurrent threads in a program are not halted indefinitely by mutual exclusion. “Lock-free” now generally refers to algorithms that guarantee some concurrent operation will always make progress; that is, that there is guaranteed progress somewhere in the system, regardless of scheduling. The more rigorous “wait-free” algorithms go further, guaranteeing every thread makes some progress.

While wait-free seems better on theoretical grounds, it is designed to handle worst-case conditions that very rarely occur, the researchers found. Using a technique called Markov chain lifting, they modeled the average behavior of multiple threads across an entire system over time to simulate the threads’ “expected asymptotic behavior.” The result showed that for many problems, the two kinds of algorithm will perform about the same. (Nevertheless, wait-free algorithms will still be the right choice for real-time systems where processes have hard deadlines, the researchers said.)

Maurice Herlihy, a professor of computer science at Brown University who was not involved in the work, said the kind of analysis the researchers applied may have broader implications elsewhere in computer science. “I think this kind of thing could be applied to a lot of situations, where theory tends to focus on worst case, rather than expected case,” he said.

Back to Top

The World Goes Parallel

Parallel computation has become increasingly important in recent years, moving from the realm of scientific supercomputing into corporate database servers and into the cellphones and personal computers of users. Limits on energy consumption, heat dissipation, and chip technology are preventing the speeds of individual processors from increasing at the rate we have seen over the past decades. To increase overall performance, chipmakers are putting more and more processor cores on a single chip. Yet using those processors’ efficiency requires the often-difficult task of rendering a program to run in multiple concurrent threads.

In some cases, such as in graphics processing, that is relatively easy, as each processor can be allocated its own separate portion of data. However, many programs contain instructions that need to share data, and programmers often “lock” those during program execution so data in memory is protected while it is being worked on.

Programmers have employed such locks for many years, but they charge a penalty. The operations inside a single lock, which might be in multiple tens of instructions, then can become a bottleneck to the overall execution of the system. One goal in these shared-memory systems is to reduce the sequential bottlenecks caused by locks, Shavit said. They can be minimized by use of fine-grained locking, which locks up fewer instructions at a time. Yet the use of lock-free algorithms opens up the possibility of parallelism at the lowest possible level of granularity, at the level of individual machine instructions, he said.

Another problem with locks is that they have “terrible semantics,” Shavit said. Programmers may set a lock in one place in a program but release it at some number of other, unpredictable places, making it hard to read and debug the code.

While lock-free algorithms solve some problems, they in turn can be tricky to code and debug. Although the move to parallel computing is leading to increased interest in these algorithms, most programmers need not be concerned with this latest analysis of lock-free versus wait-free, said Marc Snir, a computer scientist at Argonne National Laboratory and a professor at the University of Illinois at Urbana-Champaign. “It’s a concern for people who work at the bottom of the software stack, say people writing the runtime of C++ or the parallel framework in C#,” he said.

Snir said the actual results obtained from wait-free algorithms often fall short of theoretical promises for various reasons. “When you implement very low-level coordination algorithms, there are a bunch of low-level issues that come into play, and the use of wait-free algorithms or not is a small part of those considerations.”

Back to Top

Enter Transactions

In recent years, another lock-free control method has attracted considerable attention. Called “transactional memory,” it works a little like database transactions, which are required to maintain the integrity of databases accessed by multiple simultaneous users. In transactional memory, a unit of code that reads and writes to shared memory is treated as a single atomic transaction, but the memory is not locked against access by other threads. If conflicting changes to memory are made by different threads at the same time, the transaction is aborted and the changes reversed. The net result is that parallelism becomes potentially possible at the machine instruction level of granularity.

Hardware-based transactional memory was introduced last year in Intel’s Haswell and IBM’s POWER8 microprocessors. Transactional memory also exists at the software level in, for example, some C, C++, C#, and Java compilers.

Herlihy at Brown said it is too early to tell how well the new transaction-enabled chips will work. He predicts the emergence of a “transactional programming model” that avoids locks by a hybrid of hardware and software transactions. “It’s my hope that the right mix of fast low-level hardware and simpler high-level software will give us something both efficient and well-structured,” he said.

Such systems might be augmented with new programming languages specifically geared for parallel applications. These “functional” languages treat computation as if it were a series of mathematical functions whose output depends only on input and not on the values of other program or system variables. Examples include Haskell and Scala, and Herlihy predicts ideas from those languages will permeate more mainstream languages.


Transactional memory works a little like database transactions, which are required to maintain the integrity of databases accessed by multiple simultaneous users.


Explains Herlihy, “Transactional memory is a little like the ball bearings, where the concurrent parts synchronize. But languages where it’s very tightly controlled which parts of the program can be shared and modified concurrently can go a long way to making large parallel systems easier to understand and maintain.”

Back to Top

Going Forward

Debate and research surrounding the various methods of concurrency control center on the delays that can occur during program execution, and the impact of those delays on system performance. However, said supercomputer user Snir, other criteria—power consumption, reliability, availability, and maintainability—increasingly trump the performance metric. The largest system now at Argonne has 750,000 processor cores, and Snir predicts the arrival of exascale computing in which single applications run in from 100 million to one billion threads. That will require new codes that are hybrids of shared-memory and message-passing architectures, he said. And at those levels, system designers may worry less about which algorithm produces the smallest delays than about which ones are easiest to debug and which ones consume the least power.

Shavit agrees hybrid transactional systems will become the norm, with fast hardware handling smaller, simpler transactions and software larger, more complex ones. He predicts such hybrid systems will find support in the next version of the GNU C compiler within the next two years. “Once in GCC, they will show up everywhere,” he said. “People will do more and more with transactions and less and less with locking.”

The concepts of lock-free, wait-free, and transactional memory are at present mostly of interest to programmers writing high-performance parallel system software, such as in the operating system or communications software layers. “But this will change as the number of cores on a chip increases,” Shavit said. “What is now the realm of database people and operating system people will become the day-to-day business of other programmers. These programmers will have languages with all these mechanisms built in. They will be able to use transactions, and the compiler will generate code without locks and with sufficiently low granularity.”

Herlihy said there has never been a more exciting time for research in parallel computation. “The economics has changed, the technology has changed, and there are all kinds of exciting new applications like machine learning and graph search—all these things are crying out for better parallelization.”

Back to Top

Further Reading

Alisarth, D., Censor-Hillel, K., Shavit, N.
Are lock-free concurrent algorithms practically wait-free? 46th ACM Symposium on Theory of Computing (STOC 2014), New York, NY, May 2014 http://bit.ly/1rUSxTz/

Amdahl, G.
Validity of the single processor approach to achieving large-scale computing capabilities. Proceedings of the AFIPS Spring Joint Conference, 1967 http://bit.ly/1rUSzL7

Ferri, C., Wood, S., Moreshet, T., Bahar, R. I. and Herlihy, M.
Embedded-TM: energy and complexity-effective hardware transactional memory for embedded multicore systems. Journal of Parallel and Distributed Computing (JPDC), Volume 70, Issue 10, October 2010. http://bit.ly/1pLwZ4g

Herlihy, M., Shavit, N. The Art of Multiprocessor Programming.
Morgan Kaufmann Publishers Inc., San Francisco, March 2008 http://bit.ly/1sbBYRo

Intel Corp.
Introduction to Parallel Programming video lecture series, 2012. http://intel.ly/1AuM2pc

Back to Top

Back to Top

Figures

UF1 Figure. Technicians at Argonne National Laboratory work on MIRA, the fifth-fastest supercomputer in the world (as of July 2014).

Back to top

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More