The growth in the power of computers, driven for decades by Moore's Law and more recently by increased core counts, coupled with huge improvements in network bandwidth, disk densities, and other metrics, is nothing short of astonishing. Yet practitioners at the high end say we are losing the battle against the data deluge.
"Our society is literally drowning in data," notes Allan Snavely, associate director of the San Diego Supercomputer Center at the University of California at San Diego (UCSD). "Data acquisition devices ranging from space telescopes to genomic sequencing machines, and the Internet itself, are producing data almost faster than it can be written to disk."
Indeed, disks are a major part of the problem. Disk-based storage systems are 10 to 100 times slower than a network and thousands of times slower than main memory in delivering data to an application, in part because the data comes from a relatively slow electromechanical device. In recent years, fast flash memories have begun replacing disks in some applications. And manufacturers have started showing promising prototypes of more exotic non-volatile storage devices such as phase-change memory (PCM).
While substituting fast media for slow disks can help, it is by no means the whole solution. In fact, computer scientists at UCSD argue that new technologies such as PCM will hardly be worth developing for storage systems unless the hidden bottlenecks and faulty optimizations inherent in storage systems are eliminated.
A phase-change memory prototype, called Moneta, bypasses a number of functions in the operating system that typically slow the flow of data to and from storage.
A team at UCSD led by Steven Swanson, assistant professor of computer science and engineering, is doing just that. A recent prototype, called Moneta, bypasses a number of functions in the operating system (OS) that typically slow the flow of data to and from storage. These functions were developed years ago to organize data on disk and manage input and output (I/O). The overhead introduced by them was so overshadowed by the inherent latency in a rotating disk that they seemed not to matter much. But with new technologies such as PCM, which are expected to approach dynamic random-access memory (DRAM) in speed, the delays stand in the way of the technologies' reaching their full potential. Linux, for example, takes 20,000 instructions to perform a simple I/O request.
Moneta is a prototype high-performance storage array. It uses field-programmable gate arrays to implement a scheduler and a distributed set of memory controllers attached to conventional DRAM emulating PCM. (A similar prototype from the same researchers, called Onyx, uses PCM.) The scheduler orchestrates Moneta's operations and is able to extract parallelism from some I/O requests. By redesigning the Linux I/O stack and by optimizing the hardware/software interface, researchers were able to reduce storage latency by 60% and increase bandwidth as much as 18 times.
The I/O scheduler in Linux performs various functions, such as assuring fair access to resources. Moneta bypasses the scheduler entirely, reducing overhead. Further gains come from removing all locks from the low-level driver, which block parallelism, by substituting more efficient mechanisms that do not.
"The OS gives you parallelism as an illusion," says Rajesh Gupta, a professor of computer science and engineering at UCSD. "It was designed to allow multiple users to support concurrency while actually doing shared access. But shared access and parallelism are two different things."
A further reduction in latency comes from bypassing an interrupt needed to wake up a thread that sleeps while waiting for completion of an I/O request. Instead, the thread does not sleep but spins in a busy loop.
In the hardware, I/O bandwidth is increased by providing separate queues for reads and for writes, and I/O is balanced by not processing big requests all at once while smaller ones wait.
As a result of these optimizations to software and hardware, Moneta performs I/O benchmarks 9.5 times faster than a RAID array of conventional disks, 2.8 times faster than a RAID array of flash-based solid-state drives (SSDs), and 2.2 times faster than Fusion-io's flash-based SSD, a high-end flash technology.
Actual speedups on large, data-intensive jobs could be even more dramatic, says Snavely. "For the right kind of random data accesssuch as examining a social network graphflash is between 10 times and 100 times faster than spinning disk, and Moneta is 10 times faster than flash," he says. "That means some data-mining calculations that might take 24 hours today would only take 15 minutes or even less."
But, cautions Swanson, getting that kind of improvement will require rewriting or significantly reengineering the application software.
While Moneta is optimized for PCM, Swanson says many of Moneta's principles could be used with other fast non-volatile memories, such as spin-transfer torque random access memory. "The key takeaway is that when these new memories appear, we will shift to a place where the software really plays a critical role in the overall performance of the system," Swanson says. "In Moneta, we focused strongly on software for minimizing latency and maximizing concurrency."
Moneta performs I/O benchmarks 9.5 times faster than a RAID array of conventional disks, 2.8 times faster than a RAID array of flash-based solid-state drives (SSDs), and 2.2 times faster than fusion-io's high-end, flash-based SSD.
Swanson's team is now moving parts of the Moneta software out of the OS and into the storage array itself, in a special library that can be accessed by applications. Additional enhancements to reduce latency will be possible, Swanson says. For example, a database typically uses the standard I/O calls, with their inherent overhead, which are provided by the OS. But software in the new library bypasses that overhead by taking over those calls and talking to the storage hardware directly. "I can now address non-volatile storage directly from my application, just like DRAM," Gupta says. "That's the broader visiona future in which the memory system and the storage system are integrated into one."
Swanson says taking advantage of these capabilities will require extensive revamping of application software. "We have talked with companies like Oracle and Teradata, and they realize they will have to change a lot of the way their systems work," he says. A great deal of the complexity in database management systems lies in the buffer management and query optimization to minimize I/O, and much of that might be eliminated.
Also, storage systems will greatly benefit from changes in the way they access data over a network. Now, when a storage system accesses remote data on disk, it must navigate through the network stack, through file services software on both the local and remote machine, then do all that again coming back. "This change in storage performance is going to force us to look at all these different aspects of computer system design," Swanson says. "The reach of this is going to be surprisingly broad."
Swanson and colleagues at other universities are attempting to "catalyze these changes" by forming a consortiumnot yet namedof storage system researchers. They include experts from the low levels of the OS, through the application layers, and on up to the data center and network architectures, he says. "The idea is to attack all these layers at once," says Sawnson, "and hopefully demonstrate that it's worth industry's time to make these changes to commercial systems."
"I can now address non-volatile storage directly from my application, just like DRAM," Rajesh Gupta says. "That's the broader visiona future in which the memory system and the storage system are integrated into one."
Swanson's group is looking out five to eight years, he says. "The end point of this is you'll have non-volatile solid-state storage that's about as fast as DRAM," he notes. "That would be an increase of about 2,500 times improvement in latency and bandwidth. This is much faster than Moore's Law increases. I think it's the largest increase in any aspect of system performance, in the shortest time, ever. Fully exploiting these memories is going to require making changes throughout the system, but it's going to be very exciting time."
Caulfield, A., De, A., Coburn, J., Mollov, T., Gupta, R., and Swanson, S.
Moneta: A high-performance storage array architecture for next-generation, non-volatile memories, Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Atlanta, GA, Dec. 48, 2010.
Akel, A., Caulfield, A., Mollov, T., Gupta, R., and Swanson, S.
Onyx: A prototype phase-change memory storage array, Proceedings of the 3rd USENIX Conference on Hot Topics in Storage and File Systems, Portland, OR, June 14, 2011.
Mollov, T., et al.
Understanding the impact of emerging non-volatile memories on high-performance, IO-intensive computing, Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, Nov. 1319, 2010.
Lee, B., et al.
Phase-change technology and the future of main memory, IEEE Micro 3, 1, Jan.Feb. 2010.
Qureshi, K., Srinivasan, V., and Rivers, J.A.
Scalable high performance main memory system using phase-change memory technology, ISCA '09 Proceedings of the 36th Annual International Symposium on Computer Architecture, Austin, TX, June 2024, 2009.
©2012 ACM 0001-0782/12/0100 $10.00
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from email@example.com or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2012 ACM, Inc.
No entries found