Big Data Analyzed By Big Compute, Big Compute Creates Big Data

Last night I co-hosted a Birds of a Feather (BOF) session at SC13 in Denver with Lucy Nowell of the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. As we looked around the room early in the session, Lucy and I were reminded how significant the interest in all things "big data" is right now: even though the BOF started at 5:30pm and ran until after 7pm, over a hundred people came and stayed for the entire session.

Lucy and I organized the session on behalf of the Networking and Information Technology Research and Development (NITRD) High-End Computing working group, which we co-chair. The working group includes representatives of agencies across the U.S. government with significant interests in high performance computing. The BoF was our response to the observation that the "big data" discussion is too often disjoint from any discussion of the resources required to manage and gain insight from the data.

The BoF featured four speakers who each came at big data from a different perspective.

Terry Moore from the University of Tennessee at Knoxville, one of the organizers of the NSF workshop on Big Data and Extreme-scale Computing (BDEC), discussed the general findings of that workshop as a way to help frame our discussion. Terry touched on many of the issues that need to be addressed as exascale resources are developed to manage and explore big data, including the workflows and tools that users need to process data and the need for "in flow" processing of data to speed discovery of interesting features or reduce the amount of data that needs to be stored. One particularly interesting point raised during Moore’s discussion is the general need for a shared storage infrastructure that mirrors the shared transport infrastructure — the Internet — we have developed. While everyone has access to a shared platform to move data, there is not yet an equivalent shared storage platform that can assist in managing the time-sensitive positioning of data with respect to users and computational resources.

Peter Kogge from the University of Notre Dame discussed the ways in which architectures for handling big data effectively may differ from architectures that are used to create that data. In particular he noted that in many big data applications getting timely answers requires systems with global shared memory, very fast pointer following, and low-overhead threads. He compared the performance of conventional HPC architectures on a particular class of problem of interest to the insurance industry to notional exascale systems being considered today, including U.S Department of Energy-funded X-Caliber effort, and found that such designs can deliver 67 times faster performance with one-tenth the computing than conventional systems. If we are going to deal effectively with big data we may have to design hardware with big data in mind.

Ron Bewtra, the Chief Technology Officer of the U.S. National Oceanic and Atmospheric Administration (NOAA), presented his agency’s current challenges with big data and discussed some of the ways that they are meeting these challenges today. The agency has a big job just dealing with the other three billion observations it receives every day from satellites and terrestrial platforms — when you factor in the substantial HPC simulations that incorporate those data and go into the operational products NOAA delivers dozens of times every day, they truly have a large challenge. The organization moves 20-30TB a day across the country among its observation stations and HPC centers, and just one center moves several petabytes a week between disk stores and data visualization systems. Ron’s talk was interesting because it grounded the conversation in a very real problem that exists today, reminding us that as we move to new computing platforms in the future we cannot break the missions we are serving today.

The final speaker was Alok Choudhary from Northwestern University. Alok was part of the study team that produced the recent Department of Energy report Synergistic Challenges in Data-Intensive Science and Exascale Computing, a document that was commissioned to help the DoE Office of Science understand the ways in which next generation computing activities align with the needs of data-intensive science. Alok opened with extant examples of big data challenges faced by the science community today, including the observation that today’s advanced light sources produce at 1 TB/day of data whereas next generation sources may generate as much at 1 TB/s. The study explicitly identifies the intertwining of compute and data, acknowledging that big compute creates big data and big data often needs to be analyzed with big compute — at least in the science domains. Alok reviewed the investments that DoE must make in order to effectively navigate the present computational shift, including novel approaches to memory management and data movement, the integration of analysis and computation, an explicit focus on workflow tools that enable user productivity, and next generation workforce development.

The science and engineering communities have worked together through several 1,000x leaps in computing capability, each time managing to push the boundaries of the application of these technologies to the problems faced by our society. Last night I had a clear view that the transition from pan-petaFLOPS to exascale computing is interesting because it is both accelerating a move into the big data era, and being shaped by the arrival of big data from other disciplines well outside the traditional supercomputing community. If we are thoughtful as we proceed into this new era, we have a unique opportunity to create something much more effective, relevant, and useful than we ever have before.

John West is the Director of the Department of Defense High Performance Computing Modernization Program (www.hpc.mil), and a member of the executive committee for SC13.