Sign In

Communications of the ACM

ACM TechNews

Diagnosing Performance Problems in Supercomputers


Overseeing maintenance on a supercomputer.

Researchers at Sandia National Laboratories and Boston University spent more than a year developing a framework to automatically monitor and diagnose performance issues in supercomputers.

Credit: Sandia National Laboratories

Researchers at Sandia National Laboratories and Boston University (BU) have spent more than a year developing the Lightweight Distributed Metric Service (LDMS), a framework to automatically monitor and diagnose performance issues in supercomputers.

Using LDMS to diagnose supercomputer problems should help systems administrators allocate resources and schedule jobs to maximize performance.

The team says they used supervised machine learning, writing programs to reproduce known anomalies that would likely affect a Cray XC30m supercomputer at Sandia and BU's Mass Open Cloud system. With LDMS, the supercomputer compiled more than 700 metrics each second for each computer, and the cloud collected about 50 metrics at two- or three-second granularity.

Sandia's Vitus Leung notes the difference stems from the "noisiness of the data on the BU cloud, because it's not nearly as dedicated."

The researchers collated statistical characteristics of the data, filtering it to about 10% of the raw data, which was fed to machine-learning algorithms.

From Government Computer News
View Full Article

 

Abstracts Copyright © 2017 Information Inc., Bethesda, Maryland, USA


 

No entries found