Sign In

Communications of the ACM

ACM TechNews

Hpc Experts Meet to Discuss Fault Tolerance

At the Fault Tolerance for Extreme Scalability Workshop, co-sponsored by the National Science Foundation Office of Cyberinfrastructure's Blue Waters and TeraGrid projects, national experts gathered to discuss topics surrounding the fault-tolerance of petascale and exascale computing systems.

As increasingly powerful systems are built from an expanding inventory of components, a small rate of failure for an individual component can cripple a large-scale, long-running application, forcing the application to be restarted. Strategies are needed to ensure that such large systems and massive applications can tolerate and continue to operate despite these faults. Over two days, a group of experts that included high-performance computing (HPC) center staff, systems analysts, middleware specialists, and fault tolerance experts explored past practices and common problems. Presentations created discussions on challenges, successes, and opportunities to focus on fault tolerance, and speakers shared strategies to stretch the limits of capability-class computational and storage systems.

"It is invaluable for the systems specialists, middleware designers, and applications scientists to share their experiences and to talk about their expectations for other parts of the HPC ecosystem," said TeraGrid Grid Infrastructure Group's Daniel S. Katz. "This is the only way we will know what works, what doesn't work, and what we still need to do. Although the issues vary from platform to platform, there are many common experiences, tools, and techniques that, when shared, can lead to the development of best practices."

From HPC Wire
View Full Article


No entries found