Technical Perspective: Jupiter Rising

Despite all of us regularly reading of the growth of the Internet, whether it be capacity, number of users, or breadth of applications, beyond the proliferation, and ever-shortening upgrade cycle of our mobile phones, we rarely observe a physical manifestation of such growth. Yet, it is within warehouse-sized computer facilities, the datacenters that attract the superlative hyper, that the Internet growth is physically manifest. Not only must these facilities face operational challenges arising from a large and dynamic number of users, but the continuous growth in demand upon these facilities must also be accommodated. Such growth for the hyper-scale datacenter translates to a modest-sized facility covering 15 acres, housing over 200,000 servers, consuming mega-gallons of water and perhaps hundreds of megawatts of electricity too.

The nature and role of the datacenter has also grown; once familiar only as an organization’s facility or the ever-popular co-location and hosting facility where organizations might securely site their equipment. The warehouse or hyper-scale datacenter of the type discussed in the following paper, is a wholly recent beast, with ownership and control reminiscent of a single-task supercomputer. Yet, this is only a first-glance similarity; the warehouse computer approaches superlative-rich scales of computers, bandwidth needs, memory, and storage, and is used for far more than the single application of providing a popular set of Web pages.

Any datacenter may be thought of as a set of computers: some serving storage, most providing processing, combined with a network that permits both internal and external communication; the number and density of machines may vary and the network operation seems simple yet what gives rise to the complexity of the hyper datacenter is scale. The hyper-scale datacenter is an optimization; a solution to the optimal cost of capital and operation compels ever larger datacenters—as large as is practical to manage—and the hyper-scale facilities are the current logical conclusion: provided the operations are also scalable then the design will seek the largest system possible.

When dealing with 100,000 servers, the hyper-scale datacenter must accommodate the certainty of equipment failure, the operational variability of supporting many different systems, and the practical issues of such huge collections of computers: such as the complex network to support communications among each computer. Designing for systems at scale has long been the raison d’être of solutions employed within the Internet itself; yet as this paper makes clear, many of the Internet mechanisms for maintaining large-scale networks are suboptimal when, as in the case of this work, the datacenter is largely homogeneous, exhibits strong regularity within its structure, and when its bandwidth needs exceed the capabilities of those of any available equipment vendor.

The tale of developing datacenter network-fabrics able to accommodate the rapid pace of change is hugely influenced by properties of the datacenter itself: regular modular organization of systems, well-known network organization, and well-understood network needs and the authors show how their solutions are enabled by these selfsame properties. This does not prevent the lessons and process described in this paper to be more widely applicable for those outside the arena of the hyper-scale datacenter architect.

The hyper-scale datacenter discussed in the following paper is a wholly recent beast, with ownership and control reminiscent of a single-task supercomputer.

Some solutions may be a confluence of circumstance. Opportunity is created by a measurement-derived understanding of traffic growth and a lack of any practical vendor. Keen to strip any solution of unnecessary features, the availability of cheap commodity (merchant) silicon and a software-centered network-control plane leads to a scalable, bespoke, network design. Migration from old systems to new would incur mammoth challenges for a single deployment or upgrade and yet the authors describe rolling a deployment five times between 2004 and 2012, all while also limiting the impact to services. Successive network implementations are tangibly improved by this experience.

The reader is presented with other operational insights including how to achieve practical scaling by focusing upon an architecture designed in sympathy with their datacenter peculiarities. An example of this is how the design of a robust distributed system, the switch-fabric management itself, can be balanced against the benefit of a centralization of control. The authors illustrate how a centralized approach need not discard distributed advantages, an idea that is core for many scale-out designs. In turn, this provides a facility or mechanism using the resources of multiple machines all while coping with the failures inherent in any large design.

As datacenter services are central in all our lives, I commend "Jupiter Rising" to you. While the paper is steeped in operational experience it is far beyond a simple anecdote-ridden history tale; the authors describe solutions with the unique confluence of need, ability, and circumstances transposed into a fascinating set of insights for the reader.

Footnotes

To view the accompanying paper, visit doi.acm.org/10.1145/2975159

Technical Perspective: Jupiter Rising

DOI

September 2016 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Technical Perspective: Jupiter Rising

DOI

September 2016 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.