Technical Perspective: Balancing At All Loads

servers on each end of an imbalanced seesaw

Large-scale distributed parallel computing has become necessary for handling machine learning and other algorithms with ever-increasing complexity and data requirements. For example, Google TensorFlow can execute distributed algorithms that require thousands of computing nodes to work simultaneously. However, computing systems suffer from random fluctuations in service times. Power management, software or hardware failures, maintenance, and resource sharing are the primary causes of service time variability. Failures and maintenance are inevitable, and power management is crucial for reducing energy consumption. Multiplexing and balancing many applications over shared hardware and software enable high resource utilization. Therefore, random service time fluctuations are inherent to computing environments, as much as noise is unavoidable in communication systems.

Distributed parallel computing provides simultaneous execution of smaller tasks that make up larger computing jobs. However, because of the random fluctuations in service, some tasks take much more time to complete. We call these tasks stragglers. Large-scale systems split jobs into many tasks. Thus, even small-variance random service times will, with high probability, result in a non-negligible number of straggling tasks. Redundancy in task execution has emerged as a powerful way to curtail the overall service variability for two reasons: It attains load balancing without monitoring task execution speeds, and it disposes of the need to move data between nodes quickly.

To understand these benefits, consider a two-task job of multiplying vector x by two matrices, A and B, with identical dimensions. The system has four computing nodes, and thus we can execute the two tasks redundantly in various ways. A straight-forward way, which we refer to as replication, is to have two nodes compute Ax and the other two Bx. Therefore, the job executes when one of the two nodes computing Ax finishes and one of the two nodes computing Bx finishes. Another way to introduce redundancy, which we refer to as coding, is to have each of the four servers compute a different product from the set Ax, Bx, (A+B)x, and (A-B)x. Observe that when any two out of four servers finish their tasks, the job execution ends, with some simple postprocessing. Thus, coding provides more flexibility than replication.

However, with added redundancy, the scale of resource sharing grows. The fluctuations in service time (that redundancy is trying to counteract) increase. How much redundancy should we add? Much of the recent work on redundancy in distributed systems has focused on code design. Different codes have been proposed for many job types and various numbers of stragglers. However, we know very little about what exact code rate (level of redundancy) we should select for a specific system to optimize the system's performance. The following paper addresses this problem by making the codes rateless. This scheme automatically adapts redundancy levels to varying node computing speeds and offered loads. It thus achieves near-perfect load balancing for a range of system parameters without having to know their values.

Rateless codes can generate an infinite sequence of coded symbols from a set of source symbols and can guarantee a high probability of recovery of the source symbols from any subset of the coded symbols of size a little larger than the number of source symbols. Since the number of coded symbols is arbitrary, such codes do not have a fixed code rate; hence, the term rateless. A canonical class of rateless codes is random linear codes. These codes produce coded symbols as random linear combinations of source symbols. Despite their rate efficiency, random linear codes are rarely used in practice because of the computational complexity of their decoders. However, several standards for mobile wireless networks have adopted rateless coding schemes such as fountain and raptor codes because their clever design enables low complexity decoding.

The following paper focuses on matrix-vector multiplication, an essential operation in many machine algorithms.

Traditional rateless schemes adapt redundancy to varying channel gains and thus achieve better throughput than ordinary coding over channels with fluctuating conditions. Similarly, the rateless strategy proposed in this paper adapts its redundancy to unpredictable variations in computing environments, for example, the unknown number of stragglers. It thus achieves the minimum latency with minor overhead. The paper focuses on matrix-vector multiplication, an essential operation in many machine algorithms, including gradient descent.

An additional benefit of the scheme proposed by the authors is its computational efficiency. The goal of introducing redundancy is to reduce latency in job execution. However, we expect redundant job execution to cost more in computing time, network bandwidth, expanded energy. Platforms that offer computing services, for example, Amazon EC2 and MS Azure, charge for their service proportionally to the renter's cumulative computing time. Common fixed-rate coding strategies reduce job execution latency, but that involves a lot of, often unused, redundant computation. In large-scale scenarios, the coding strategy proposed in the paper approaches the latency of ideal load balancing, and it performs nearly zero redundant computations.

Footnotes

To view the accompanying paper, visit doi.acm.org/10.1145/3524298